Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Universal Learning over Related Distributions and Adaptive Graph Transduction 1. Go beyond transfer learning to sample selection bias †, Wei Fan‡, Jing Peng*, and uncertainty mining Erheng Zhong 2. Unified framework ‡, and Jiangtao Ren† Verscheure 3.Olivier One single solution: supervised case †Sun Yat-Sen University ‡IBM T. J. Watson Research Center *Montclair State University Standard Supervised Learning training (labeled) test (unlabeled) Classifier New York Times 85.5% New York Times 2 Sample Selection Bias training (labeled) test (unlabeled) 78.5% 85.5% Classifier New York Times New York Times Have a different word vector distribution August: a lot about typhoon in Taiwan 3 September: a lot about US Open Uncertainty Data Mining • Training Data: – Both feature vectors and class labels contain noise (usually Gaussian) – Common for data collected from sensor network • Testing data: – Feature vector contain noises Summary • Traditional supervised learning: – Training and testing data follow the identical distribution • Transfer learning: – from different domains • Sample selection bias: – from same domain but distribution is different • such as, missing not at random • Uncertain data mining: – data contains noise • In other words: in all three cases, training and testing data are from different distributions. • Traditionally, each problem is handled separately. Main Challenge Performance Drop Domain Transfer Sample Selection Bias Noisy Training Data Data from Different but Related Distributions Could one solve these different but similar problems under a uniform framework? With the same solution? Universal Learning Universal Learning • is the subsets of X that are the support of some hypothesis in a fixed hypothesis space ([Blitzer et al, 2008] • The distance between two distributions ([Blitzer et al, 2008] h1 h2 Test Distribution h3 b c Db=|ptr(b)-pte(b)| h1(a)=0, h2(a)=1, h3(a)=1 Dc=|ptr(c)-pte(c)| =max(Db,Dc)*2 b h1(b)=1, h2(b)=1, h3(b)=1 a b c d h1(c)=1, h2(c)=1, h3(c)=1 h1(d)=0, h3(d)=0 Training h2(d)=1, Distribution c How to Handle Universal Learning? • Most traditional classifiers could not guarantee the performance when training and test distributions are different. • Could we find one classifier under weeker assumption? Graph Transduction? Advantage of Graph Transduction • Weaker assumption that the decision boundary lies on the low density regions of the unlabeled data. Two-Gaussians vs. Two-arcs Just Graph Transduction? • “Un-smooth label”(more examples in low density region) and “class imbalance” problems ([Wang et al, 2008]) may mislead the decision boundary to go through the high density regions. • Bottom part closest red square • More red square than blue square Sample Selection: which samples? Maximum Margin Graph Transduction • In margin-terms, unlabeled data with low margin are likely misclassified! 3 3 2 Decision Boundary Decision Boundary 2 1 1 0 0 -1 -1 -2 -2 -3 -2 -1.5 -1 Bad sample -0.5 0 -3 -2 -1.5 -1 -0.5 Good sample 0 Main Flow Predict the labels of unlabeled data Maximize the unlabeled data margin Labeled Source Maximum Margin Domain Data Graph Transduction Sample Selection Unlabeled Target Domain Data Lift the unlabeled data margin Update Graph Label Propagation Average Ensemble Properties • Adaptive Graph Transduction can be bounded Training error in terms of Emprical Error of the distance approximating between ideal training and thetest idealdistribution hypothesis hypothesis Properties • If one classifier has larger unlabeled data margin, it will make the training error smaller (recall last theorem) • Average ensemble is likely to achieve larger margin Reuters Experiment – Data Set org • Transfer Learning place Target-Domain org.subA place.subA – Reuters: 21758 Reuters news First fillarticles up the “GAP”, then use knn classifier to do classification – SyskillWebert: HTML source of web pages plus org.subB place.subB SyskillWebert the ratings of a user on those web pages Source-Domain Target-Domain Source-Domain Sheep First fill up the “GAP”,Bandsthen use Biomedical recording knn classifier to do classification Goats Experiment – Data Set Feature 1 • Sample Selection Bias Correction – UCI data set: Ionosphere, Diabetes, Haberman, WDBC Feature 2 • Uncertainty1.Randomly Mining select 50% of the features, – Kent Ridge Biomedical high to and then sort theRepository: data set according eachlow selected features; dimensional, sample size (HDLSS) 2.we attain top instances from every sorted listGenerate as trainingtwo set; different Gaussian Noises and add them into training and test set Experiment -- Baseline methods • Original graph transduction algorithm ([Zhu, 2005]) – Using the entire training data set – Variation: choosing a randomly selected sample whose size is equal to the one chosen by MarginGraph • CDSC: transfer learning approach ([Ling et al, 2008]) – find a mapping space which optimizes over consistency measure between the out-domain supervision and indomain intrinsic structure • BRSD-BK/BRSD-DB: bias correction approach ([Ren et al, 2008]) – discover structure and re-balance using unlabeled data Performance--Transfer Learning Accuracy -- Transfer Learning 100 Graph RandGraph CDSC MarginGraph 95 90 Accuracy 85 80 O vs Pl Sheep Pe vs Pl O vs Pe Goats Biomedical 75 70 65 60 55 50 1 2 3 4 5 6 Data set Perform best on 5 of 6 data sets! AUC -- Transfer Learning 100 Graph RandGraph CDSC MarginGraph 95 90 85 O vs Pe O vs Pl AUC 80 Pe vs Pl Biomedical 75 Sheep Goats 70 65 60 55 50 1 2 3 4 5 Data set Perform best on 5 of 6 data sets! 6 Performance--Sample Selection Bias AUC -- Sample Selection Bias Accuracy -- Sample Selection Bias 100 100 Graph RandGraph BRSD-BK BRSD-DB MarginGraph 95 90 Ionosphere Wdbc 90 85 80 Wdbc 80 75 Diabetes AUC Accuracy 85 Graph RandGraph BRSD-BK BRSD-DB MarginGraph 95 Haberman 75 70 70 65 65 60 60 55 55 Ionosphere Diabetes Haberman 50 1 2 3 Data set 4 50 1 2 3 Data set Accuracy: Best on all 4 data sets! AUC: Best on 2 of 4 data sets. 4 Performance--Uncertainty Mining Accuracy -- Uncertainty Mining AUC -- Uncertainty Mining 100 95 90 100 Leukemia Graph RandGraph MarginGraph 90 ColonTumor 85 ColonTumor ProCancer ProCancer 80 80 75 AUC Accuracy Leukemia Graph RandGraph MarginGraph CNS 70 CNS 70 60 65 60 50 55 50 1 2 3 Data set 4 40 1 2 3 Data set Accuracy: Best on all 4 data sets! AUC: Best on all 4 data sets! 4 Margin Analysis • MarginBase is the base classifier of MarginGraph in each iteration. • LowBase is a “minimal margin classifier” which selects samples for building a classifier with minimal unlabeled data margin. • LowGraph is the averaging ensemble of LowBase. Maximal margin is better than minimal margin Ensemble is better than any single classifiers Conclusion • Cover different formulations where the training and test set are drawn from related but different distributions. • Flow – Step-1 Sample selection -- Select labeled data from different distribution which could maximize the unlabeled data margin – Step-2 Label Propagation -- Label the unlabeled data – Step-3 Ensemble -- Further lift the unlabeled data margin • Code and data available from http://www.cs.columbia.edu/~wfan