Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Support Cluster Machine Paper from ICML2007 Read by Haiqin Yang 2007-10-18 This paper, Support Cluster Machine, was written by Bin Li, Mingmin Chi, Jianping Fan, Xiangyang Xue, which was published in 2007. 1 Outline Background and Motivation Support Cluster Machine - SCM Kernel in SCM Experiments An Interesting Application: Privacy-preserving Data Mining Discussions 2 Background and Motivation Large scale classification problem Decomposition methods Osuna et al., 1997; Joachims, 1999; Platt, 1999; Collobert & Bengio, 2001; Keerthi et al., 2001; Incremental algorithms Cauwenberghs & Poggio, 2000; Fung & Mangasarian, 2002; Laskov et al., 2006; Parallel techniques Collobert et al., 2001; Graf et al., 2004; Approximate formula Fung & Mangasarian, 2001; Lee & Mangasarian, 2001; Choose representatives Active learning - Schohn & Cohn, 2003; Cluster Based-SVM -Yu et al., 2003; Core Vector Machine (CVM) - Tsang et al., 2005; Clustering SVM -Boley, D. & Cao, 2004; 3 Support Cluster Machine - SCM Given training samples: Procedure 4 SCM Solution Dual representation Decision function 5 Kernel Probability product kernel By Gaussian assumption, i.e., Hence 6 Kernel Property I That is Decision function Property II 7 Experiments Datasets Classification methods Toydata libSVM MNIST – Handwritten digits SVMTorch (‘0’-’9’) classification Adult – Privacy-preserving Dataset SVMlight Clustering algorithms Threshold Order Dependent (TOD) EM algorithm CVM (Core Vector Machine) SCM Model selection CPU: 3.0GHz 8 Toydata Samples: 2500 samples/class generated from a mixture of Gaussian distribution Clustering algorithm: TOD Clustering results: 25 positive, 25 negative 9 MNIST Data description 10 classes: Handwritten digits ‘0’-’9’ Training samples: 60,000, about 6000 for each class Testing samples: 10,000 Construct 45 binary classifiers Results 25 Clusters for EM algorithm 10 MNIST Test results for TOD algorithm 11 Privacy-preserving Data Mining Inter-Enterprise data mining Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information. Horizontally partitioned Records (users) split across companies Example: Credit card fraud detection model Vertically partitioned Attributes split across companies Example: Associations across websites 12 Privacy-preserving Data Mining Randomization approach 30 | 70K | ... 50 | 40K | ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms ... ... ... Model 13 Classification Example A g e S a l a r y R e p e a t V i s i t o r ? 2 35 0 K R e p e a t 1 73 0 K R e p e a t 4 34 0 K R e p e a t 6 85 0 K S i n g l e 3 27 0 K S i n g l e 2 02 0 K R e p e a t A g e < 2 5 N o Y e s R e p e a t S a l a r y < 5 0 K Y e s N o R e p e a t S i n g l e 14 Privacy-preserving Dataset: Adult Data description Training samples: 30162 Testing samples: 15060 Percentage of positive samples: 24.78% Procedure Horizontally partition data into three subsets (parties) Cluster by TOD algorithm Obtain three positive and three negative GMMs Combine positive and negative GMMs into one positive and one negative GMMs with modified priors Classify them by SCM 15 Privacy-preserving Dataset: Adult Partition results Experimental results 16 Discussions Solved problems Large scale problems: downsample by clustering + classifier Privacy-preserving problems: hide individual information Differences to other methods Training units are generative model, testing units are vectors Training units contain complete statistical information Only one parameter for model selection Easy implementation Generalization ability is not clear, while the RBF kernel in SVM has the property of larger width leads to lower VC dimension. 17 Discussions Advantages of using priors and covariances 18 Thank you! 19