Download Support Cluster Machine

Support Cluster Machine Paper from ICML2007 Read by Haiqin Yang 2007-10-18 This paper, Support Cluster Machine, was written by Bin Li, Mingmin Chi, Jianping Fan, Xiangyang Xue, which was published in 2007. 1 Outline  Background and Motivation  Support Cluster Machine － SCM  Kernel in SCM  Experiments  An Interesting Application: Privacy-preserving Data Mining  Discussions 2 Background and Motivation  Large scale classification problem  Decomposition methods  Osuna et al., 1997;  Joachims, 1999;  Platt, 1999;  Collobert & Bengio, 2001;  Keerthi et al., 2001;  Incremental algorithms  Cauwenberghs & Poggio, 2000;  Fung & Mangasarian, 2002;  Laskov et al., 2006;  Parallel techniques  Collobert et al., 2001;  Graf et al., 2004;  Approximate formula  Fung & Mangasarian, 2001;  Lee & Mangasarian, 2001;  Choose representatives  Active learning － Schohn & Cohn, 2003;  Cluster Based-SVM －Yu et al., 2003;  Core Vector Machine (CVM) － Tsang et al., 2005;  Clustering SVM －Boley, D. & Cao, 2004; 3 Support Cluster Machine － SCM  Given training samples:  Procedure   4 SCM Solution  Dual representation  Decision function 5 Kernel  Probability product kernel  By Gaussian assumption, i.e.,  Hence 6 Kernel  Property I  That is  Decision function  Property II 7 Experiments  Datasets  Classification methods  Toydata  libSVM  MNIST – Handwritten digits  SVMTorch (‘0’-’9’) classification  Adult – Privacy-preserving Dataset  SVMlight  Clustering algorithms  Threshold Order Dependent (TOD)  EM algorithm  CVM (Core Vector Machine)  SCM  Model selection    CPU: 3.0GHz 8 Toydata  Samples: 2500 samples/class generated from a mixture of Gaussian distribution  Clustering algorithm: TOD  Clustering results: 25 positive, 25 negative 9 MNIST  Data description  10 classes: Handwritten digits ‘0’-’9’  Training samples: 60,000, about 6000 for each class  Testing samples: 10,000  Construct 45 binary classifiers  Results  25 Clusters for EM algorithm 10 MNIST  Test results for TOD algorithm 11 Privacy-preserving Data Mining  Inter-Enterprise data mining  Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information.  Horizontally partitioned  Records (users) split across companies  Example: Credit card fraud detection model  Vertically partitioned  Attributes split across companies  Example: Associations across websites 12 Privacy-preserving Data Mining  Randomization approach 30 | 70K | ... 50 | 40K | ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms ... ... ... Model 13 Classification Example A g e S a l a r y R e p e a t V i s i t o r ? 2 35 0 K R e p e a t 1 73 0 K R e p e a t 4 34 0 K R e p e a t 6 85 0 K S i n g l e 3 27 0 K S i n g l e 2 02 0 K R e p e a t A g e < 2 5 N o Y e s R e p e a t S a l a r y < 5 0 K Y e s N o R e p e a t S i n g l e 14 Privacy-preserving Dataset: Adult  Data description  Training samples: 30162  Testing samples: 15060  Percentage of positive samples: 24.78%  Procedure  Horizontally partition data into three subsets (parties)  Cluster by TOD algorithm  Obtain three positive and three negative GMMs  Combine positive and negative GMMs into one positive and one negative GMMs with modified priors  Classify them by SCM 15 Privacy-preserving Dataset: Adult  Partition results  Experimental results 16 Discussions  Solved problems  Large scale problems: downsample by clustering + classifier  Privacy-preserving problems: hide individual information  Differences to other methods  Training units are generative model, testing units are vectors  Training units contain complete statistical information  Only one parameter for model selection  Easy implementation  Generalization ability is not clear, while the RBF kernel in SVM has the property of larger width leads to lower VC dimension. 17 Discussions  Advantages of using priors and covariances 18 Thank you! 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Support Cluster Machine