Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Instance-based Learning: Locally linear reconstruction and its applications Data Mining Lab. Seoul National University Pilsung Kang 2010. 01. 05. Table of Contents Introduction: Instance-based Learning Locally Linear Reconstruction for Classification & Regression Learning Algorithms Distance & Local Topology-based Hybrid Score for Novelty Detection Local Topology-based Seed Initialization for Clustering Application I: Response Modeling Real-world Applications Application II: Virtual Metrology Application III: Keystroke Dynamics Analysis Conclusion Pilsung Kang, DataMining Laboratory, SNU 1 Introduction: What is learning? “A computer program is said to learn from experience E with respect to some class of task T and performance measure P, if its performance at task in T, as measured by P, improves with experience E,” – Tom Mitchell. Pilsung Kang, Data Mining Laboratory, SNU 2 Introduction: Data for machine learning Data Digitized and (un)structured observations on real-world events that should be provided to the machine to learn. Not only required to build learning models, but also determine the type of learning tasks. Pilsung Kang, Data Mining Laboratory, SNU 3 Introduction: Data for machine learning Unsupervised learning Explores intrinsic characteristics. Estimates underlying distribution. Density estimation, clustering, novelty detection, etc. Pilsung Kang, Data Mining Laboratory, SNU 4 Introduction: Data for machine learning Supervised learning y f ( x) Finds relations between X and Y. Estimate the underlying function y f ( x) . Classification, regression. Pilsung Kang, Data Mining Laboratory, SNU 5 Introduction: Instance-based learning Instance-based learning (IBL) Also called memory-based reasoning (MBR) or lazy learning. A non-parametric approach where training or learning does not take place until a new query is made. k-nearest neighbor (k-NN) is the most popular. k-NN covers most learning tasks such as density estimation, novelty detection, classification, and regression, etc. Pilsung Kang, Data Mining Laboratory, SNU 6 Introduction: k-NN density estimation k-NN density estimation With the small region containing x: . According to the binomial law, the probability that k of N instances fall within For large n, the following holds, If the region . is sufficiently small, The estimated density becomes Pilsung Kang, Data Mining Laboratory, SNU , where V is the volume of . . 7 Introduction: k-NN novelty detection k-NN novelty detection Use distance information as a novelty score. Maximum distance, average distance, and distance to the mean vector, etc. Distance to the nearest neighbor Distance to the k-th neighbor d 1 NN x n 1 x 1 n 1 Pilsung Kang, Data Mining Laboratory, SNU d k max x n 1 x k n 1 Average distance to k neighbors d k avg 1 k x n 1 xin 1 k i 1 8 Introduction: k-NN classification and regression k-NN classification k-NN regression The weight assigned to the neighbor xi Pilsung Kang, Data Mining Laboratory, SNU 9 Introduction: k-NN classification and regression k-NN classification examples Pilsung Kang, Data Mining Laboratory, SNU 10 Introduction: Strengths of k-NN learning Strengths of k-NN learning Sound theoretical background. The error rate of 1-NN is bounded by the twice the Bayes error rate. The error rate of k-NN converges to the Bayes error rate as k increases provided with a sufficient number of reference instances. Fast training procedure unlike complex learning algorithms such as neural networks or support vector machines. Fits naturally well with on-line (incremental) learning. Pilsung Kang, Data Mining Laboratory, SNU 11 Introduction: Application areas of k-NN learning Unsupervised learning Density estimation Data restoration. Clustering Image processing, text categorization. Novelty detection Intrusion detection, identity verification, user authentication, medical diagnosis. Supervised learning Classification Collaborative filtering, image processing, text mining, bioinformatics. Regression Manufacturing, time-series analysis. Pilsung Kang, Data Mining Laboratory, SNU 12 Introduction: Limitations of k-NN learning Supervised learning (classification and regression) Parameter dependency. How many nearest neighbors should be considered? How to give those neighbors appropriate weights? Novelty detection Counter examples exist. Conventional nearest-neighbor-based novelty detectors conflict with intuition. Clustering Most seed initialization techniques are purely heuristic. Pilsung Kang, Data Mining Laboratory, SNU 13 Introduction: Contributions Learning algorithms A systematic weight allocation method, locally linear reconstruction (LLR), is proposed for classification and regression. LLR is able to identify important neighbors for the prediction. LLR assigns the appropriate weights for the important neighbors. A distance & local topology-based hybrid score is proposed for novelty detection. The hybrid score combines two distances, one of which is associated with relative similarity, while the other of which is associated with local topology. The hybrid novelty score is able to overcome conventional nearest-neighborbased novelty detectors. Pilsung Kang, Data Mining Laboratory, SNU 14 Introduction: Contributions Learning algorithms A new seed initialization algorithm based on centrality, sparsity, and isotropy, (CSI), is proposed for clustering. Three properties associated with inter- or intra-cluster variance are identified. Relative similarity and local topology are used for measuring these properties. CSI is able to lead K-Means clustering algorithm to the optimal clustering structure rapidly. Pilsung Kang, Data Mining Laboratory, SNU 15 Introduction: Contributions Real-world applications LLR classification and CSI are employed for response modeling. Response modeling: to predict whether each customer will respond to a given marketing campaign. Class imbalance is a common and significant problem of response modeling. Class imbalance is alleviated and response rate predictive accuracy is improved. LLR regression is employed for virtual metrology. Virtual metrology: to predict the metrological values using sensor data and other relevant information in semiconductor manufacturing. A good prediction model should be robust to the parameters while keeping prediction accuracy as high as possible. Both goals are achieved by LLR regression. Pilsung Kang, Data Mining Laboratory, SNU 16 Introduction: Contributions Real-world applications A distance & local topology-based hybrid score is employed for keystroke dynamics-based user authentication. KDA: to authenticate users based on their keyboard typing behaviors. KDA should be formulated as a novelty detection problem. An authenticator should work well for incremental environments. A distance & local topology-based hybrid score results in outstanding authentication performances. Pilsung Kang, Data Mining Laboratory, SNU 17 Table of Contents Introduction: Instance-based Learning Locally Linear Reconstruction for Classification & Regression Learning Algorithms Distance & Local Topology-based Hybrid Score for Novelty Detection Local Topology-based Seed Initialization for Clustering Application I: Response Modeling Real-world Applications Application II: Virtual Metrology Application III: Keystroke Dynamics Analysis Conclusion Pilsung Kang, DataMining Laboratory, SNU 18 Locally Linear Reconstruction: Classification and regression k-NN classification k-NN regression The weight assigned to the neighbor xi Pilsung Kang, Data Mining Laboratory, SNU 19 Locally Linear Reconstruction: Issues and current solutions Issues and current solutions How many nearest neighbors should be considered? Empirically determined by cross-validation. Domain experts decided for certain real-world cases. How to give those neighbors appropriate weights? “A father neighbor gets a smaller weights” Kernel functions, which decrease proportional to the dissimilarity are commonly used. Kernel function examples: , Pilsung Kang, Data Mining Laboratory, SNU , , 20 Locally Linear Reconstruction: Algorithm An illustration of LLR algorithm procedure Pilsung Kang, Data Mining Laboratory, SNU 21 Locally Linear Reconstruction: Algorithm for classification LLR classification algorithm Step 1: Compute the distance and find k nearest neighbors. Step 2: Minimize the reconstruction error to find the critical neighbors and their corresponding weights. Step 3: Make a prediction based on the assigned weights. Pilsung Kang, Data Mining Laboratory, SNU 22 Locally Linear Reconstruction: Algorithm for classification Proposition 1: LLR for classification The optimal weight w is determined by minimizing the reconstruction error, Min 2 1 1 k j E (w ) x n 1 j 1 w j x n 1 x n 1 X nNN1w 2 2 x T n 1 X nNN1w , with two constraints, w j 0, j, j w j 1. This minimization problem can be solved by any algorithm developed for solving the quadratic program (QP). Pilsung Kang, Data Mining Laboratory, SNU 23 Locally Linear Reconstruction: Algorithm for classification LLR classification algorithm Pilsung Kang, Data Mining Laboratory, SNU 24 Locally Linear Reconstruction: Algorithm for classification Proposition 2: Computational complexity of LLR classification Let n be the number of reference instance and k be the number of nearest neighbors, then the computational complexity of conventional k-NN is O(nlogn), LLR with standard QP is O(nlogn + k3), LLR with SMO is between O(nlogn+k) and O(nlogn+ k2). Pilsung Kang, Data Mining Laboratory, SNU 25 Locally Linear Reconstruction: Algorithm for classification Proof Conventional k-NN: Distance calculation: O(n), sorting: O(nlogn), weight allocation: O(k). Total computational complexity: O(nlogn). LLR with standard QP: Distance calculation: O(n), sorting: O(nlogn), weight allocation: O(k3). Total computational complexity: O(nlogn+k3). LLR with SMO: Distance calculation: O(n), sorting: O(nlogn), weight allocation: O(k) ~ O(k2). Total computational complexity: O(nlogn+k) ~ O(nlogn+k2). Pilsung Kang, Data Mining Laboratory, SNU 26 Locally Linear Reconstruction: Algorithm for regression Proposition 3: Explicit solution for LLR regression The optimal weight w is determined by minimizing the reconstruction error, Min 2 1 1 k j E (w ) x n 1 j 1 w j x n 1 x n 1 X nNN1w 2 2 x T n 1 X nNN1w , without any constraint. Proof 2 1 1 k j E (w ) x n 1 j 1 w j x n 1 x n 1 X nNN1w 2 2 E (w ) x n 1 X nNN1w w T w X NN T n 1 X NN n 1 x n1 X T NN n 1 T x T n 1 X nNN1w , X nNN1 0, T , w x Tn1 X nNN1 X nNN1 X nNN1 Pilsung Kang, Data Mining Laboratory, SNU 1 T , w w . jwj 27 Locally Linear Reconstruction: Algorithm for regression Proposition 4: Linear equations for LLR regression The optimal weight w is determined by minimizing the reconstruction error, Min 2 1 1 k j E (w ) x n 1 j 1 w j x n 1 x n 1 X nNN1w 2 2 with one constraint, j x T n 1 X nNN1w , w j 1. Proof 2 1 1 k j E (w ) x n 1 j 1 w j x n 1 x n 1 XnNN1w 2 2 x T n 1 XnNN1w , s.t. w T 1 1. This can be rewritten as follows, Min T 1 NN E (w ) X n 1 X n 1 w X n 1 X nNN1 w , s.t. w T 1 1. 2 Pilsung Kang, Data Mining Laboratory, SNU 28 Locally Linear Reconstruction: Algorithm for regression Proof (cont’) The primal Lagrangian of the problem becomes, L 1 T w CL w w T 1 1 . 2 The Karash-Kuhn-Tucker condition for the optimal solution becomes, L CL w 1 0 CL w 1 . w The solution of this problem can be obtained by solving a series of linear equations, T CL w 1 and re-scale w such that w 1 1. Pilsung Kang, Data Mining Laboratory, SNU 29 Locally Linear Reconstruction: Algorithm for regression Pilsung Kang, Data Mining Laboratory, SNU 30 Locally Linear Reconstruction: Algorithm for regression Proposition 5: Computational complexity of LLR regression Let n be the number of reference instance and k be the number of nearest neighbors, then the computational complexity of LLR regression is O(nlogn+k2.376). Proof Distance calculation: O(n), sorting: O(nlogn), matrix factorization: O(k2.376). Total computational complexity: O(nlogn+k2.376). Pilsung Kang, Data Mining Laboratory, SNU 31 Locally Linear Reconstruction: Performance evaluation Data sets Classification Regression Pilsung Kang, Data Mining Laboratory, SNU 32 Locally Linear Reconstruction: Performance evaluation Benchmark kernel functions The number of k used in classification and regression Performance measures Class accuracy (classification), RMSE and MAPE (regression) Pilsung Kang, Data Mining Laboratory, SNU 33 Locally Linear Reconstruction: Classification performance Classification performance (class accuracy) Pilsung Kang, Data Mining Laboratory, SNU 34 Locally Linear Reconstruction: Classification performance Classification performance w.r.t. various k Pilsung Kang, Data Mining Laboratory, SNU 35 Locally Linear Reconstruction: Classification performance Classification performance w.r.t. various k (cont’) Pilsung Kang, Data Mining Laboratory, SNU 36 Locally Linear Reconstruction: Classification performance The average number of important neighbors (non-zero weighted neighbors) The number of important neighbors (neighbors with nonzero weights) increases in a log scale, then remain stable beyond a certain level. Pilsung Kang, Data Mining Laboratory, SNU 37 Locally Linear Reconstruction: Classification performance The execution time of LLR classification As the number of reference instances increases, the computation time for the QP becomes negligible compared to the sorting time. Pilsung Kang, Data Mining Laboratory, SNU 38 Locally Linear Reconstruction: Classification performance Execution time of LLR classification Pilsung Kang, Data Mining Laboratory, SNU 39 Locally Linear Reconstruction: Regression performance Classification performance (RMSE) Pilsung Kang, Data Mining Laboratory, SNU 40 Locally Linear Reconstruction: Regression performance Classification performance (MAPE) Pilsung Kang, Data Mining Laboratory, SNU 41 Locally Linear Reconstruction: Regression performance Regression performance (RMSE) w.r.t. various k Pilsung Kang, Data Mining Laboratory, SNU 42 Locally Linear Reconstruction: Regression performance Regression performance (MAPE) w.r.t. various k Pilsung Kang, Data Mining Laboratory, SNU 43 Locally Linear Reconstruction: Classification performance Execution time of LLR regression Pilsung Kang, Data Mining Laboratory, SNU 44 Locally Linear Reconstruction: Summary Locally linear reconstruction algorithm A local topology-based optimization problem is formulated. Able to find important neighbors for prediction. Able to assign appropriate weights to those neighbors. Performance evaluation Outperformed conventional weight allocation methods both for classification and regression. Found to be robust to the number of nearest neighbors (k). Additional computational burden was not so significant. Pilsung Kang, Data Mining Laboratory, SNU 45 Table of Contents Introduction: Instance-based Learning Locally Linear Reconstruction for Classification & Regression Learning Algorithms Distance & Local Topology-based Hybrid Score for Novelty Detection Local Topology-based Seed Initialization for Clustering Application I: Response Modeling Real-world Applications Application II: Virtual Metrology Application III: Keystroke Dynamics Analysis Conclusion Pilsung Kang, DataMining Laboratory, SNU 46 Novelty Detection: Definition Novelty detection What is novel instance? “Observations that deviate so much from other observations as to arouse suspicions that they were generated by a different mechanism (Hawkins, 1980)” “Instances that their true probability density is very low (Harmeling et al., 2006)” Binary classification vs. Novelty detection (Lee, 2007) Binary classification Pilsung Kang, Data Mining Laboratory, SNU Novelty detection 47 Novelty Detection: Approaches Properties for the success of novelty detection Flexibility Ability of generating an arbitrary shape of description boundary. Simplicity The small number of model parameters. Updatability Ability to update the model with new instances. Stability Sensitivity to initial condition of the model learning. Pilsung Kang, Data Mining Laboratory, SNU 48 Novelty Detection: Approaches Properties of various novelty detection algorithms Nearest-neighbor-based novelty detectors have many positive properties. Pilsung Kang, Data Mining Laboratory, SNU 49 Novelty Detection: Nearest-neighbor-based approaches Maximum distance (Ramaswamy et al., 2000) Distance to the kth nearest neighbor: k d max x n 1 x kn 1 Average distance (Angiulli and Pizzuti, 2005) Average distance to k nearest neighbors: d k avg 1 k x n 1 xin 1 k i 1 Distance to the mean (Harmeling et al., 2006) Distance to the mean vector of k nearest neighbors: d Pilsung Kang, Data Mining Laboratory, SNU k m ean 1 k i xn 1 xn 1 k i 1 50 Novelty Detection: Nearest-neighbor-based approaches The effect of nearest-neighbor-based novelty detectors Pilsung Kang, Data Mining Laboratory, SNU 51 Novelty Detection: Counter examples Which one should be identified as novel? A B A (k=4) B (k=5) dkmax dkavg dkmean Circle 1.58 1.14 0.50 Triangle 1.64 1.07 0.94 Circle 1.56 1.08 0.80 Triangle 1.86 1.09 0.88 Pilsung Kang, Data Mining Laboratory, SNU 52 Distance & Local Topology-based Hybrid Novelty Score The hybrid novelty score algorithm Step 1: Compute the distance and find k nearest neighbors. Step 2 (absolute measure): Compute the average distance to k nearest neighbors. Pilsung Kang, Data Mining Laboratory, SNU 53 Distance & Local Topology-based Hybrid Novelty Score The hybrid novelty score algorithm (cont’) Step 3 (relative measure): Compute the distance to the convex hull that is constituted by the neighbors. Pilsung Kang, Data Mining Laboratory, SNU 54 Distance & Local Topology-based Hybrid Novelty Score The hybrid novelty score algorithm (cont’) Step 4 (combine two measures): Compute the hybrid score by combining the absolute measure and relative measure. Pilsung Kang, Data Mining Laboratory, SNU 55 Distance & Local Topology-based Hybrid Novelty Score Pilsung Kang, Data Mining Laboratory, SNU 56 Distance & Local Topology-based Hybrid Novelty Score Counter examples revisited A B A (k=4) B (k=5) dkmax dkavg dkmean dkhybrid Circle 1.58 1.14 0.50 1.42 Triangle 1.64 1.07 0.94 1.18 Circle 1.56 1.08 0.80 1.18 Triangle 1.86 1.09 0.88 1.09 Pilsung Kang, Data Mining Laboratory, SNU 57 The Hybrid Novelty Score: An illustrative example Pilsung Kang, Data Mining Laboratory, SNU 58 The Hybrid Novelty Score: Performance evaluation Data sets Pilsung Kang, Data Mining Laboratory, SNU 59 The Hybrid Novelty Score: Performance evaluation Grouping data sets in terms of The number of normal instances (TrNn) Small: Reference instances < 200 Large: Reference instances > 200 The number of attributes (Dim.) Low: Attributes < Reference instances High: Attributes > Reference instances Pilsung Kang, Data Mining Laboratory, SNU 60 The Hybrid Novelty Score: Performance evaluation Performance measures Integrated error Robust to an arbitrary threshold setting. Pilsung Kang, Data Mining Laboratory, SNU 61 The Hybrid Novelty Score: Performance evaluation Benchmark novelty detectors Three density-based Gaussian density estimation (Gauss). Mixture of Gaussians (MoG). Parzen window density estimator (Parzen). One support vector-based One class support vector machine (1-SVM). Three clustering-based K-Means clustering (KMC). K-Center clustering (KCC). Average linkage-based hierarchical clustering (HC). Pilsung Kang, Data Mining Laboratory, SNU 62 The Hybrid Novelty Score: Performance evaluation Benchmark novelty detectors (cont’) One dimensionality reduction-based Principal component analysis (PCA). Five distance-based Max distance (dmax). Average distance (davg). Distance to the mean vector (dmean). 1-nearest neighbor (1-NN). Minimum spanning tree (MST). A total of 13 benchmark novelty detectors Pilsung Kang, Data Mining Laboratory, SNU 63 The Hybrid Novelty Score: Novelty detection performance Pilsung Kang, Data Mining Laboratory, SNU 64 The Hybrid Novelty Score: Novelty detection performance In low dimensions (Group A and Group B data sets) The proposed hybrid score (dkhybrid) was outstanding. Best for eight data sets out of ten. Followed by MST-CD and HC. In high dimensions (Group B and Group C data sets) dkhybrid and MST-CD were superior to other novelty detectors. When dimensionality is high, local topology becomes more important. Best for four and three data sets out of 11. In common Gauss and PCA were generally inferior to other novelty detectors. They are not able to produce an arbitrary shape of class boundary. Pilsung Kang, Data Mining Laboratory, SNU 65 The Hybrid Novelty Score: Novelty detection performance Execution time of all novelty detectors Pilsung Kang, Data Mining Laboratory, SNU 66 The Hybrid Novelty Score: Summary Distance & Local topology-based novelty detector A hybrid novelty score combining distance and local topology is proposed. Absolute measure: average distance to k nearest neighbors. Relative measure: distance to the convex hull made by k nearest neighbors. Able to overcome limitations of conventional nearest-neighbor-based novelty detectors. Performance evaluation Improved conventional nearest-neighbor-based novelty detectors. Outperformed other state-of-the art novelty detectors for most cases. Kept computational complexity low. Pilsung Kang, Data Mining Laboratory, SNU 67 Table of Contents Introduction: Instance-based Learning Locally Linear Reconstruction for Classification & Regression Learning Algorithms Distance & Local Topology-based Hybrid Score for Novelty Detection Local Topology-based Seed Initialization for Clustering Application I: Response Modeling Real-world Applications Application II: Virtual Metrology Application III: Keystroke Dynamics Analysis Conclusion Pilsung Kang, DataMining Laboratory, SNU 68 Clustering: Overview Clustering A data analysis tool that partitions the entire data set into some number of meaningful subsets or groups, called clusters. A good clustering algorithm results in a clustering structure where The set of clusters is heterogeneous. Each cluster is homogeneous. Pilsung Kang, Data Mining Laboratory, SNU 69 Clustering: K-Means clustering K-Means clustering By far the most widely used clustering algorithm. Finds K clusters by minimizing the within-cluster sum of squared error, Benefits of K-Means clustering Works well with any Lp norm. Allows straightforward parallelization. Does not depend on data ordering. Pilsung Kang, Data Mining Laboratory, SNU 70 Clustering: K-Means clustering Limitation of K-Means clustering Clustering structure relies on the choice of initial seed. Pilsung Kang, Data Mining Laboratory, SNU 71 Clustering: K-Means clustering seed initialization Seed initialization approaches Methods Description Limitations R-Mean He et al. (2002) Simply adding a Gaussian noise to the mean vector Randomness Considers absolute distance among seed Randomness SCS Tou and Gonzalez (1974) KKZ Katsavounidis et al. (1994) Considers relative distance among seed Does not consider sparsity KR Kaufman and Rousseeuw (1990) Considers distances from seed to other instances Heavy computational cost CCIA Khan and Ahmad (2004) Use attribute information Kd-tree Redmond and Heneghan Use density information (2007) Pilsung Kang, Data Mining Laboratory, SNU Order dependency Parameter sensitivity 72 Clustering: K-Means clustering seed initialization Three properties required for good seed Centrality An initial seed should be located in the middle of a cloud of instances. Leads to quick convergence. Sparsity Any pair of seeds is separated by a sparse region. Help find an optimal clustering structure. Isotropy Seeds should be located far from each other. Help find an optimal clustering structure. Pilsung Kang, Data Mining Laboratory, SNU 73 Clustering: K-Means clustering seed initialization Examples that lack some properties Pilsung Kang, Data Mining Laboratory, SNU 74 CSI: Algorithm The CSI algorithm Step 1 (Centrality): Set the instances with zero dc-hull to the seed candidates. d4c-hull=0 Pilsung Kang, Data Mining Laboratory, SNU d4c-hull>0 75 CSI: Algorithm The CSI algorithm (cont’) Step 2 (Isotropy): The geometric means of the distances between a candidate and existing seeds are computed. High isotropy Pilsung Kang, Data Mining Laboratory, SNU Low isotropy 76 CSI: Algorithm The CSI algorithm (cont’) Step 3 (Sparsity): The minimum radius of the empty ball is computed. dr dr dR Pilsung Kang, Data Mining Laboratory, SNU 77 CSI: Algorithm The CSI algorithm (cont’) Step 4 (Seed score): Compute the score based on sparsity and isotropy. Pilsung Kang, Data Mining Laboratory, SNU 78 CSI: Algorithm Pilsung Kang, Data Mining Laboratory, SNU 79 CSI: Algorithm Pilsung Kang, Data Mining Laboratory, SNU 80 CSI: Performance evaluation Benchmark methods Method Source Method Source R-MEAN He et al. (2002) KR Kaufman and Rousseeuw (1990) SCS Tou and Gonzalez (1974) CCIA Khan and Ahmad (2004) KKZ Katsavounidis et al. (1994) Kd-tree Redmond and Heneghan (2007) Data sets Data N Ins. N Att. Class Data N Ins. N Att. Class Synthetic 300 2 4 Iris 150 4 3 Abalone 1,854 8 5 Handdigits 2,000 76 10 Segmentation 2,100 19 7 Letter 3,878 16 5 Satellite 4,435 36 5 Pilsung Kang, Data Mining Laboratory, SNU 81 CSI: Performance evaluation Performance measures Sum of squared error SD validity index Class accuracy Pilsung Kang, Data Mining Laboratory, SNU 82 CSI: Clustering results Sum of squared error (SSE) The lower, the better. CSI performed best for six out of seven data sets. KR, CCIA, kd-tree performed best for two data sets. Data set K R-SEL RMEAN SCS KKZ KR CCIA Kd-tree CSI Synthetic 4 1,250 1,239 1,204 1,249 1,240 1,243 1,217 1,154 Iris 3 172 161 161 140 140 140 140 140 Abalone 5 3,580 3,517 3,660 3,660 3,671 3,640 3,614 3,409 Handdigits 10 108,145 108,218 108,272 108,541 107,056 107,086 107,048 107,135 Segmentation 7 14,131 13,177 15,013 19,031 12,561 11,841 18,602 11,841 Letter 5 35,922 36,426 35,776 36,312 34,764 34,841 36,012 34,744 Satellite 6 37,352 34,136 40,927 39,899 34,136 38,477 37,568 34,136 Pilsung Kang, Data Mining Laboratory, SNU 83 CSI: Clustering results Sum of squared error (SSE) 130 125 120 115 110 105 100 95 90 Synthetic Iris R-SEL Abalone RMEANS Pilsung Kang, Data Mining Laboratory, SNU SCS Handdigit KKZ KR Segmentation CCIA Letter kd-tree Satellite CSI 84 CSI: Clustering results SD validity index The lower, the better. CSI performed best for six out of seven data sets. CCIA and kd-tree performed best for two data sets. Data set K R-SEL RMEAN SCS KKZ KR CCIA Kd-tree CSI Synthetic 4 0.181 0.175 0.159 0.203 0.197 0.187 0.145 0.116 Iris 3 0.561 0.557 0.557 0.550 0.550 0.550 0.550 0.550 Abalone 5 0.545 0.536 14.290 14.290 14.238 0.544 0.539 0.518 Handdigits 10 0.839 0.837 0.841 0.847 0.733 0.821 0.742 0.732 Segmentation 7 3.152 2.968 3.831 15.167 2.702 2.700 2.754 2.741 Letter 5 0.714 0.719 0.703 0.724 0.668 0.701 0.684 0.659 Satellite 6 0.319 0.282 0.369 0.350 0.284 0.291 0.282 0.282 Pilsung Kang, Data Mining Laboratory, SNU 85 CSI: Clustering results SD validity index 130 125 120 115 110 105 100 95 90 Synthetic Iris R-SEL Abalone RMEANS Pilsung Kang, Data Mining Laboratory, SNU SCS Handdigit KKZ KR Segmentation CCIA Letter kd-tree Satellite CSI 86 CSI: Clustering results Class accuracy The higher, the better. CSI performed best for six out of seven data sets. KKZ, KR, CCIA, and kd-tree performed best for two data sets. Data set K R-SEL RMEAN SCS KKZ KR CCIA Kd-tree CSI Synthetic 4 0.691 0.690 0.690 0.691 0.690 0.690 0.690 0.690 Iris 3 0.730 0.764 0.764 0.833 0.833 0.833 0.833 0.833 Abalone 5 0.468 0.485 0.502 0.502 0.447 0.500 0.503 0.513 Handdigits 10 0.508 0.505 0.515 0.474 0.533 0.541 0.528 0.563 Segmentation 7 0.584 0.562 0.415 0.302 0.592 0.589 0.596 0.608 Letter 5 0.546 0.518 0.554 0.510 0.615 0.618 0.604 0.624 Satellite 6 0.705 0.746 0.626 0.647 0.746 0.746 0.746 0.746 Pilsung Kang, Data Mining Laboratory, SNU 87 CSI: Clustering results Class accuracy 110 105 100 95 90 85 80 Synthetic Iris R-SEL Abalone RMEANS Pilsung Kang, Data Mining Laboratory, SNU SCS Handdigit KKZ KR Segmentation CCIA Letter kd-tree Satellite CSI 88 CSI: Clustering results Computational time RMEAN was the fastest, while KR was the slowest. CSI was comparable with SCS, KKS, CCIA, and kd-tree. 1000.000 100.000 10.000 1.000 Synthetic Iris Abalone Handdits Segmentation Letter Satellite 0.100 0.010 0.001 RMEAN Pilsung Kang, Data Mining Laboratory, SNU SCS KKZ KR CCIA kd-Tree CSI 89 CSI: Clustering results Clustering iterations KR and CSI converged faster than others in general. The convergence speeds of R-Sel, SCS, and KKZ were slower. 60.0 50.0 40.0 30.0 20.0 10.0 0.0 Synthetic Iris R-Sel Pilsung Kang, Data Mining Laboratory, SNU Abalone RMEAN SCS Handdits KKZ KR Segmentation CCIA kd-Tree Letter Satellite CSI 90 CSI: Summary A new seed initialization algorithm (CSI) for K-Means clustering A new seed initialization algorithm (CSI) for K-Means clustering is proposed. Three properties (centrality, sparsity, isotropy) are identified and accommodated. Similarity and local topology are taken into account. Performance evaluation Able to find the optimal clustering structure. Lead to a quick convergence. Pilsung Kang, Data Mining Laboratory, SNU 91 Table of Contents Introduction: Instance-based Learning Locally Linear Reconstruction for Classification & Regression Learning Algorithms Distance & Local Topology-based Hybrid Score for Novelty Detection Local Topology-based Seed Initialization for Clustering Application I: Response Modeling Real-world Applications Application II: Virtual Metrology Application III: Keystroke Dynamics Analysis Conclusion Pilsung Kang, DataMining Laboratory, SNU 92 Application I: Response modeling Response modeling Identifies customers who are likely to purchase a product, based on their purchase history and other information. Firms attempt to induce higher potential buyers to purchase the campaigned product using their communication channels, e.g., phone, catalog, or e-mail. Demographic information: Age, Sex, Job… Product Potential Customers Behavioral Information: Recency, Frequency, Monetary… Pilsung Kang, Data Mining Laboratory, SNU 93 Application I: Response modeling A well-developed response model can Increase total revenue, Lower total marketing cost. Higher revenues Well-developed Response Model Lowered costs Pilsung Kang, Data Mining Laboratory, SNU 94 Application I: Response modeling Increasing response rate is not an easy task, but its impact is incredible. Only a small increase of response rate can Change the total result of a direct mailing campaign from failure to success (Baesens et al., 2002). Boost the total revenue and raised revenue per respondent significantly (Knott et al., 2002). Not only increase profit, but also strengthen customer loyalty (Sun et al, 2006). Pilsung Kang, Data Mining Laboratory, SNU 95 Response modeling: Approaches Statistics Logistic regression (Aaker et al., 2001; Hosmer and Lemeshow, 1989). Stochastic RFM (Colombo and Jiang, 1999). Hazard function (Gonul et al., 2000). Pattern recognition and data mining Artificial neural networks (Baesens et al., 2002; Kaefer et al., 2005). Bagging artificial neural networks (Ha et al., 2005). Bayesian neural networks (Baesens et al., 2002). Support vector machines (Shin and Cho, 2006). Decision trees (Coenen et al., 2000). Pilsung Kang, Data Mining Laboratory, SNU 96 Response modeling: Class imbalance Class imbalance in response modeling Non-respondents overwhelmingly outnumber respondents. 9.4% are respondents in DMEF4 data set (Shin and Cho, 2006). Only 6% are respondents in CoIL Challenge 2000 data set (Putten et al., 2000). Response rate in general direct marking situations are often much lower. Responders Pilsung Kang, Data Mining Laboratory, SNU Non-responders 97 Response modeling: Class imbalance Approaches to deal with class imbalance Cost differentiation Algorithm modification Boundary alignment Handling Class Imbalance Under-sampling Data balancing Over-sampling Data balancing methods are more universal in that they can be combined with any prediction models. Pilsung Kang, Data Mining Laboratory, SNU 98 Response modeling: Data balancing methods Under-sampling based methods Reduces the number of majority class instances while keeping all the minority class instances. Effective in reducing training time. Often distort the class distribution (sampling bias). Random under-sapming, SHRINK (Kubat et al., 1997), one-sided selection (OSS) (Kubat and Matwin, 1997). Respondents Pilsung Kang, Data Mining Laboratory, SNU Non-respondents 99 Response modeling: Data balancing methods Over-sampling based methods increases the number of minority class instances while keeping all the majority class instances. Preserve the original data distribution. Increase training time. Random over-sapming, SMOTE (Chawla et al., 1994), SMOTE boost (Chawla et al., 1998). Respondents Pilsung Kang, Data Mining Laboratory, SNU Non-respondents 100 Response modeling: Proposal A new data balancing method based on clustering, under-sampling, and ensemble (CUE). Clustering-based under-sampling Eliminate the sampling bias. Reduce the performance variation. K-Means clustering with CSI seed initialization is employed. Ensemble-based prediction model Boost the response rate predictive accuracy. LLR classification is employed. Pilsung Kang, Data Mining Laboratory, SNU 101 Response modeling: CUE procedure Customer data Separate the customer data into respondents and nonrespondents. Step 2: Divide the non-respondents using clustering. Respondents Step 1: : Non-respondents Respondents : Respondents, Pilsung Kang, Data Mining Laboratory, SNU Non-respondents Cluster 1 Cluster 2 Cluster K 102 Training set 1 Training set 2 Respondents Sampled Nonrespondents Cluster K Cluster 2 Sampled Nonrespondents Construct multiple training sets by combining the respondents and sampled non-respondents from each segment. Respondents Step 3: Cluster 1 Sampled Nonrespondents Divide the non-respondents using clustering. Respondents Step 2: Respondents Response modeling: CUE procedure Training set N Step 4: Train each prediction model with corresponding training set. Prediction model 1 Pilsung Kang, Data Mining Laboratory, SNU Prediction model 2 Prediction model N 103 Training set 1 Training set 2 Sampled Nonrespondents Respondents Sampled Nonrespondents Respondents Construct multiple training sets by combining the respondents and sampled non-respondents from each segment. Sampled Nonrespondents Step 3: Respondents Response modeling: CUE procedure Training set N Step 4: Train each prediction model with corresponding training set. Prediction model 1 Step 5: Make a prediction by aggregating the prediction results. Pilsung Kang, Data Mining Laboratory, SNU Prediction model 2 Prediction model N Is he/she going to respond? 104 Response modeling: Implementation Data sets CoIL Challenge 2000 Provided by the Dutch data mining company “Sentiment Machine Research” for a data mining competition purpose. To predict which customers are potentially interested in a caravan insurance polity. 85 explanatory variables: 42 product usage variables, 43 socio-demographic variables. Training set: 348(5.98%) respondents out of 5,822 customers. Test set: 238(5.95) respondents out of 4,000 customers. Pilsung Kang, Data Mining Laboratory, SNU 105 Response modeling: Implementation Data sets DMEF4 Provided by the “Direct Marketing Education Foundation” for research purpose. To discover customers who purchased the product in the test period based on demographic and historic purchase information in the reference period. 15 variables are selected from 91 explanatory variables. 101,532 customers with 9.4% of respondents. Pilsung Kang, Data Mining Laboratory, SNU 106 Response modeling: Implementation Benchmark data balancing methods No-sampling (NS) Under-sampling Random under sampling (RUS). One-sided selection (OSS). Over-sampling Random over-sampling (ROS). Synthetic minority over-sampling technique (SMOTE). Pilsung Kang, Data Mining Laboratory, SNU 107 Response modeling: Implementation Classification models Logistic regression (LR) Multi-layer perceptron (MLP) k-nearest neighbor classification with locally linear reconstruction (k-NN) Support vector machine (SVM) Performance measures Balanced correction rate (BCR) Geometric mean of majority class accuracy and minority class accuracy. Pilsung Kang, Data Mining Laboratory, SNU 108 Response modeling: CoIL Challenge 2000 results Prediction accuracy Pilsung Kang, Data Mining Laboratory, SNU 109 Response modeling: CoIL Challenge 2000 results BCR improvement and variation reduction of CUE over RUS, ROS, and SMOTE (%) BCR improved the most with MLP. Performance variation reduction was significant. Pilsung Kang, Data Mining Laboratory, SNU 110 Response modeling: CoIL Challenge 2000 results True response rate (TRR) vs. true non-response rate (TNR) Pilsung Kang, Data Mining Laboratory, SNU 111 Response modeling: CoIL Challenge 2000 results Lift charts Pilsung Kang, Data Mining Laboratory, SNU 112 Response modeling: DMEF4 results Prediction accuracy Pilsung Kang, Data Mining Laboratory, SNU 113 Response modeling: DMEF4 results BCR improvement and variation reduction of CUE over RUS, ROS, and SMOTE (%) BCR improved the most with MLP. Performance variation reduction was significant for MLP and k-NN. Pilsung Kang, Data Mining Laboratory, SNU 114 Response modeling: DMEF4 results True response rate (TRR) vs. true non-response rate (TNR) Pilsung Kang, Data Mining Laboratory, SNU 115 Response modeling: DMEF4 results Lift charts Pilsung Kang, Data Mining Laboratory, SNU 116 Response modeling: DMEF4 results Total profit with various marketing costs Pilsung Kang, Data Mining Laboratory, SNU 117 Response modeling: DMEF4 results Total revenues and costs with various marketing costs (LR and MLP) Pilsung Kang, Data Mining Laboratory, SNU 118 Response modeling: DMEF4 results Total revenues and costs with various marketing costs (k-NN and SVM) Pilsung Kang, Data Mining Laboratory, SNU 119 Response modeling: Summary CUE for response modeling To deal with class imbalance. A new data balancing method based on clustering, under-sampling, ensemble was proposed to boost response rate predictive accuracy and lower performance variation. CSI algorithm was implemented in clustering step, while LLR classification was implemented in prediction step. Performance evaluation CUE improved prediction accuracy while keeping the variance low. LLR classification was the most accurate and profitable prediction model. Pilsung Kang, Data Mining Laboratory, SNU 120 Table of Contents Introduction: Instance-based Learning Locally Linear Reconstruction for Classification & Regression Learning Algorithms Distance & Local Topology-based Hybrid Score for Novelty Detection Local Topology-based Seed Initialization for Clustering Application I: Response Modeling Real-world Applications Application II: Virtual Metrology Application III: Keystroke Dynamics Analysis Conclusion Pilsung Kang, DataMining Laboratory, SNU 121 Application II: Virtual metrology Virtual metrology Predict, not actually measure, the metrological values using sensor data from production equipments and actual metrological values of sampled wafers. Pilsung Kang, Data Mining Laboratory, SNU 122 Application II: Virtual metrology A well-developed virtual metrology system Enhance the final yield by managing scrapped wafers appropriately. Enable a predictive maintenance based on real-time forecast of metrological data. Can detect process drifts timely and promptly. Enable run-to-run (R2R) process control. Reduce the cost and time required for actual metrology. Pilsung Kang, Data Mining Laboratory, SNU 123 Application II: Virtual metrology Practical issues on virtual metrology A large number of input variables Very large number of sensor parameters are activated. Curse of dimensionality. A limited number of wafers available Only a few wafers are actually measured. Not enough training instances. Non-stationary process Process environments may change. Frequent model update is mandatory. Pilsung Kang, Data Mining Laboratory, SNU 124 Application II: Proposal A new virtual metrology system A large number of input variables Reduced by dimensionality reduction techniques. A limited number of wafers available Employ an instance-based learning algorithm. k-NN regression with LLR. Non-stationary process Adopt prediction algorithms that are suitable for incremental learning. k-NN regression with LLR. Pilsung Kang, Data Mining Laboratory, SNU 125 Virtual Metrology: Process Overlay in photolithography The lateral positioning between layers comprising integrated circuits. Overlay misalignment. Overlay Misalignment Ideal Situation 4th Layer 3rd Layer 2nd Layer 1st Layer In Practice 4th Layer 3rd Layer 2nd Layer 1st Layer Pilsung Kang, Data Mining Laboratory, SNU 126 Virtual Metrology: Data Data description Collected from two chucks for eight months. 1,612 wafers from chuck 1, 1,563 wafers from chuck 2. Sensor parameters 37 sensor parameters. 4 summary statistics (mean, standard deviation, max, min) are recorded. A total of 148 variables exist. Target metrology variables Eight variables regarding overlay misalignment. 4 variables (Y1, Y2, Y3, Y4) have more impact on productivity than others. Pilsung Kang, Data Mining Laboratory, SNU 127 Virtual Metrology: Model update Moving window scheme Jan. Feb. Model building Mar. Apr. Aug. Test Model building Pilsung Kang, Data Mining Laboratory, SNU Test 128 Virtual Metrology: Model update The number of wafers used for each period Chuck 1 Chuck 2 Pilsung Kang, Data Mining Laboratory, SNU 129 Virtual Metrology: Dimensionality reduction Variable selection Select a subset of important input variables for the prediction model. Stepwise selection with linear regression (Stepwise LR). Genetic algorithm with linear regression (GA-LR). Genetic algorithm with support vector regression (GA-SVR). Variable extraction Construct a reduced set of input variables by transforming the original variables. Principal component analysis (PCA). Kernel principal component analysis (KPCA). Pilsung Kang, Data Mining Laboratory, SNU 130 Virtual Metrology: Prediction model A total of four regression models are employed Linear regression k-NN regression with LLR Pilsung Kang, Data Mining Laboratory, SNU Multi-layer perceptron (MLP) Support vector regression (SVR) 131 Virtual Metrology: Performance measures Mean squared error (MSE) How well a prediction model fits the relation between input variables and targets. Mean absolute specification error (MASE) How closely the model predicts the target with regard to its tolerance. Pilsung Kang, Data Mining Laboratory, SNU 132 Virtual Metrology: Dimensionality reduction results The number of selected input variables Reduced total number of variables to between 21.5% (stepwise LR) to 42.7% (GA-SVR). Pilsung Kang, Data Mining Laboratory, SNU 133 Virtual Metrology: Dimensionality reduction results The number of extracted input variables Only 14 variables can explain the 50% variance of original input data. Two-third of variables can explain the 99% variance, implying that many variables are highly correlated with each other. Pilsung Kang, Data Mining Laboratory, SNU 134 Virtual Metrology: Prediction results Best VM model for each metrology measurement k-NN with LLR resulted in the best model for seven cases, followed by MLP six cases. Pilsung Kang, Data Mining Laboratory, SNU 135 Virtual Metrology: Prediction results Prediction results (MSE, Y1~Y4) Pilsung Kang, Data Mining Laboratory, SNU 136 Virtual Metrology: Prediction results Prediction results (MSE, Y5~Y8) Pilsung Kang, Data Mining Laboratory, SNU 137 Virtual Metrology: Prediction results Prediction results (MASE, Y1~Y4) Pilsung Kang, Data Mining Laboratory, SNU 138 Virtual Metrology: Prediction results Prediction results (MASE, Y5~Y8) Pilsung Kang, Data Mining Laboratory, SNU 139 Virtual Metrology: Prediction results Parameter sensitivity of MLP and k-NN with LLR MLP is very sensitive to its number of hidden nodes. k-NN with LLR is robust to the number of nearest neighbors. Pilsung Kang, Data Mining Laboratory, SNU 140 Virtual Metrology: Prediction results Prediction example Y1, k-NN with LLR, trained based on five months (Mar. to Jul.) Pilsung Kang, Data Mining Laboratory, SNU 141 Virtual Metrology: Summary k-NN regression with LLR for virtual metrology Virtual metrology predicts metrological values based on available production information. Small training wafers and parameter sensitivity are handled. Performance evaluation k-NN with LLR resulted in the best prediction model. Its parameter sensitivity is much lower than other learning algorithms. Pilsung Kang, Data Mining Laboratory, SNU 142 Table of Contents Introduction: Instance-based Learning Locally Linear Reconstruction for Classification & Regression Learning Algorithms Distance & Local Topology-based Hybrid Score for Novelty Detection Local Topology-based Seed Initialization for Clustering Application I: Response Modeling Real-world Applications Application II: Virtual Metrology Application III: Keystroke Dynamics Analysis Conclusion Pilsung Kang, DataMining Laboratory, SNU 143 Application III: Keystroke dynamics analysis Keystroke dynamics The way that a person types a string of characters. A Duration B C D Interval [40, 50, 30, 35, 40, -15, 40] Time (ms) Password-based user authentication Most commonly used authentication system. Easy to develop, operate, maintain, and cost-efficient. Vulnerable when the password is leaked. Pilsung Kang, Data Mining Laboratory, SNU 144 Application III: Keystroke dynamics analysis Keystroke dynamics-based user authentication Use one’s keystroke typing behavior as well as password. Strengthen the security of one’s account. Pilsung Kang, Data Mining Laboratory, SNU 145 Application III: Keystroke dynamics analysis Practical issues on keystroke dynamics-based authentication Data availability Only a valid user’s keystroke typing patterns are available. Classification models are not appropriate. Concept drift One’s typing behavior can change. Frequent model update is mandatory. Pilsung Kang, Data Mining Laboratory, SNU 146 Application III: Proposal Keystroke dynamics-based user authentication (KDA) Data availability Classification models are not appropriate. Novelty detection problem is formulated. Concept drift Frequent model update is mandatory. The distance & local topology based hybrid novelty score (dhybrid) is employed. Pilsung Kang, Data Mining Laboratory, SNU 147 KDA: Data Group A data sets Collected in 1996~1998 from a SUN workstation. 21 users were involved whose training typing patterns varies from 76~388. 15 impostors were recruited. 75 valid & 75 impostor test patterns. Pilsung Kang, Data Mining Laboratory, SNU 148 KDA: Data Group B data sets Collected in 2005 from the subjects’ own PC. 25 users were involved whose training typing patterns were 30. Users changed their roles. 24 valid & 24 impostor test patterns. Pilsung Kang, Data Mining Laboratory, SNU 149 KDA: Authenticators Benchmark novelty detectors Three density-based Gaussian density estimation (Gauss). Mixture of Gaussians (MoG). Parzen window density estimator (Parzen). One support vector-based One class support vector machine (1-SVM) Three clustering-based K-Means clustering (KMC) K-Center clustering (KCC) Average linkage-based hierarchical clustering (HC) Pilsung Kang, Data Mining Laboratory, SNU 150 KDA: Authenticators Benchmark novelty detectors (cont’) One dimensionality reduction-based Principal component analysis (PCA) Five distance-based Max distance (dmax) Average distance (davg) Distance to the mean vector (dmean) 1-nearest neighbor (1-NN) Minimum spanning tree (MST) A total of 13 benchmark novelty detectors Pilsung Kang, Data Mining Laboratory, SNU 151 KDA: Incremental learning and performance measures Incremental learning Test patterns are evaluated individually and independently (one at a time). The model is updated according to the test pattern’s prediction results, not its actual label. All valid test patterns and randomly selected 10 impostor patterns were used. Performance measure: Integrated error Pilsung Kang, Data Mining Laboratory, SNU 152 KDA: Authentication results Authentication performance for Group A users Pilsung Kang, Data Mining Laboratory, SNU 153 KDA: Authentication results Improvement over davg and 1-SVM for Group A users Pilsung Kang, Data Mining Laboratory, SNU 154 KDA: Authentication results Authentication performance for Group B users Pilsung Kang, Data Mining Laboratory, SNU 155 KDA: Authentication results Improvement over davg and 1-SVM for Group B users Pilsung Kang, Data Mining Laboratory, SNU 156 KDA: Authentication results Total computation time for each group When the number of training patterns are large, 1-SVM, PCA and dhybrid were efficient, while PCA and dhybrid were still efficient when the number of typing patterns increased. dhybrid was able to achieve both high detection performance and efficiency. Pilsung Kang, Data Mining Laboratory, SNU 157 KDA: Summary The distance & local topology based hybrid novelty score (dhybrid) for keystroke dynamics-based user authentication Keystroke dynamics-based user authentication utilizes one’s keyboard typing behavior for authentication. Only valid users’ typing patterns are available and frequent model update is required. The distance & local topology based hybrid novelty score (dhybrid) is employed. Performance evaluation dhybrid resulted in the best novelty detection performance for both groups. It is efficiently adopted for incremental learning environments. Pilsung Kang, Data Mining Laboratory, SNU 158 Table of Contents Introduction: Instance-based Learning Locally Linear Reconstruction for Classification & Regression Learning Algorithms Distance & Local Topology-based Hybrid Score for Novelty Detection Local Topology-based Seed Initialization for Clustering Application I: Response Modeling Real-world Applications Application II: Virtual Metrology Application III: Keystroke Dynamics Analysis Conclusion Pilsung Kang, DataMining Laboratory, SNU 159 What Have Been Done Locally linear reconstruction (LLR) for classification and regression A systematic weight allocation method based on local topology. Able to identify important neighbors for the prediction. Assigns the appropriate weights for the important neighbors. Distance & local topology based hybrid score (dhybrid) for novelty detection Take both absolute and relative similarity into account. Absolute similarity is associated with average distance to one’s neighbors. Relative similarity is associated with local topology among one’s neighbors. Able to overcome conventional nearest-neighbor-based novelty detectors. Outperformed other popular novelty detectors. Pilsung Kang, Data Mining Laboratory, SNU 160 What Have Been Done A new seed initialization algorithm based on centrality, sparsity, and isotropy, (CSI), for clustering Three properties associated with inter- or intra-cluster variance are identified. Relative similarity and local topology are used for measuring these properties. Able to lead K-Means clustering algorithm to the optimal clustering structure rapidly. Pilsung Kang, Data Mining Laboratory, SNU 161 What Have Been Done Application to response modeling LLR classification and CSI were employed for handling class imbalance. Improved response rate and reduced performance variation. Application to virtual metrology LLR regression was employed for handling parameter sensitivity and model update. Both goals are successfully achieved. Application to keystroke dynamics analysis The hybrid novelty score (dhybrid) was employed for handling data availability and model update. Efficient and accurate authenticator could be built. Pilsung Kang, Data Mining Laboratory, SNU 162 What Should Be Done In learning theory Similarity measures Non-numeric attributes Optimizing the number of clusters Scalability Concept drift environment Response modeling Combining algorithm modification and data balancing method, uplift modeling Virtual metrology Integration with R2R control systems Keystroke dynamics analysis Long-free text based authentication, account sharing Pilsung Kang, Data Mining Laboratory, SNU 163 Q&A Pilsung Kang, Data Mining Laboratory, SNU 164