* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Generalized linear model wikipedia , lookup
Computational complexity theory wikipedia , lookup
Artificial intelligence wikipedia , lookup
Natural computing wikipedia , lookup
Theoretical computer science wikipedia , lookup
Genetic algorithm wikipedia , lookup
Drift plus penalty wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Shapley–Folkman lemma wikipedia , lookup
Pattern recognition wikipedia , lookup
Reinforcement learning wikipedia , lookup
Machine learning wikipedia , lookup
Large-Scale Text Categorization By Batch Mode Active Learning Steven C.H. Hoi†, Rong Jin‡, Michael R. Lyu† † CSE Department, Chinese University of Hong Kong ‡ CSE Department, Michigan State University 26-May, 2006 To appear in International World Wide Web conference, Edinburgh, Scotland, 22-26 May, 2006. Outline Introduction Related Work Batch Mode Active Learning Theoretical Foundation Convex Optimization Formulation Eigen Space Simplification Bound Optimization Algorithm Experimental Results Conclusion and Future Work 2 Introduction Text Categorization Problem Assign documents to predefined topics Significances Core Web data mining technique Applications: category browsing, vertical search, etc. Challenges To build efficient classifiers To minimize human labeling effort 3 Introduction Logistic Regression Efficiency for Training and Prediction Natural Probability Output State-of-the-art performance, etc… Linear model where is the class label. Simplified notation: 4 Introduction Active Learning To find most informative unlabeled examples Traditional Methodology Choose one unlabeled example for labeling Retrain the classifier with the additional example Limitation Only one example in each iteration, huge retraining cost Our solution: Batch Mode Active Learning To find a batch of most informative unlabeled examples 5 Outline Introduction Related Work Batch Mode Active Learning Theoretical Foundation Convex Optimization Formulation Eigen Space Simplification Bound Optimization Algorithm Experimental Results Conclusion and Future Work 6 Related Work Statistical Models for Classification K Nearest Neighbors (Masand et al., SIGIR’92), Decision Trees (Apte et al., TOIS’94), Bayesian Classifiers (Tzeras et al., SIGIR’93), Inductive Rule Learning (Cohen et al., ICML’95), etc. Neural Networks (Ruiz et al., IR’02), Support Vector Machines (SVM) (Joachims, ECML’98, Tong et al., ICML’00), and Logistic Regressions (Zhang et al., ICML’00), etc. 7 Related Work Active Learning Query-By-Committee (Liere et al AAAI’97), EM & Active Learning (Nigam et al’98), etc. Margin Based Methods: Support Vector Machine Active Learning (Tong et al., ICML’00) Measure uncertainty by the distances from decision boundaries 8 Batch Mode Active Learning Toy Example – Positive examples of class-1 – Negative examples of class-2 – Unlabeled examples – Selected examples for labeling D1 D2 (a) Binary classification example (b) Margin-based active learning (c) Batch mode active learning 9 Outline Introduction Related Work Batch Mode Active Learning Theoretical Foundation Convex Optimization Formulation Eigen Space Simplification Bound Optimization Algorithm Experimental Results Conclusion and Future Work 10 Theoretical Foundation Main Idea: Based on the theoretical framework of maximization of Fisher information Problem Setting In a probabilistic classification framework, assume the classification model is a semi-parametric form For example, the logistic regression model: 11 Theoretical Foundation The problem of batch mode active learning can be regarded as a problem to seek a resample distribution q(x) of the unlabeled data. The examples with large resampling probabilities will be selected as the most informative ones for labeling. According to statistical estimation theory, active learning should consider a resample distribution q(x) that maximizes the following Fisher information 12 Theoretical Foundation The maximization of Fisher information is equivalent to find the resample distribution q(x) that minimizes the ratio of two Fisher information matrixes: For the logistic regression model, the Fisher information matrix can be expressed as: We replace the integration in the above equation with the summation over the unlabeled data: 13 Convex Optimization Formulation Rewrite the objective function as Introduce a slack matrix ,then turn the original problem into the following optimization: In the above, we use 14 Convex Optimization Formulation By the Schur complementary theorem, i.e., we turn it into the following optimization : 15 Convex Optimization Formulation The final optimization problem can be expressed The above problem belongs to the family of Semidefinite programming (SDP) and can be solved by convex optimization techniques. 16 Eigen Space Simplification Directly solving the above optimization problem is computationally expensive for the large-size slack matrix variable of M. In order to reduce the computational complexity, we propose an Eigen space simplification method to make the solution simpler and more effective. We assume that M is expanded in the Eigen space of the Fisher information matrix Ip. 17 Eigen Space Simplification Let be the top s eigen vectors of the Fisher information matrix Ip, where λ1 ≥ λ2 ≥ . . . ≥ λs, then we assume the matrix M has the following form: We rewrite the inequality 18 Eigen Space Simplification Using the eigen expression, we have Given the necessary condition for is Therefore, we have the following result 19 Eigen Space Simplification The above necessary condition leads to following constraints: Meanwhile, the objective function of tr(M) can be expressed as 20 Eigen Space Simplification By putting the above two expressions together, we transform the SDP problem into the following approximate optimization problem: Note that the above optimization problem belongs to convex optimization since f(x) = 1/x is convex when x ≥ 0. 21 Bound Optimization Algorithm Lemma 1: Let L(q) be the objective function, we have the following conclusion: 22 Bound Optimization Algorithm Given the lemma 1, now instead of optimizing the original objective function L(q), we can optimize its upper bound using simple updating equations:, This algorithm will guarantee to converge to a local optimal. Since the original problem is a convex optimization problem, the above updating procedure will guarantee to converge to a global optimal. 23 Bound Optimization Algorithm The updating step: Some Observations (i) The example with a large classification uncertainty will be assigned with a large probability. (ii) The example that is similar to many unlabeled examples is more likely to be selected. 24 Outline Introduction Related Work Batch Mode Active Learning Theoretical Foundation Convex Optimization Formulation Eigen Space Simplification Bound Optimization Algorithm Experimental Results Conclusion and Future Work 25 Experimental Testbeds 3 standard text datasets Reuters-21578 dataset (10788) Two web-related datasets: WebKB (4518) and Newsgroup (10966) 26 Experimental Settings A standard feature selection by Information Gain is conducted to remove uninformative features, in which 500 of the most informative features are selected. The F1 metric is adopted as our evaluation metric, which has been shown to be more reliable metric than other metrics such as the classification accuracy. More specifically, the F1 is defined as where p and r are precision and recall. Parameters of LogReg and SVM are determined by a standard cross validation method. 27 Comparison Schemes Two popular active learning methods: SVM-AL: the classification uncertainty of an example x is determined by its distance to the decision boundary The smaller the distance d(x;w, b) is, the more the classification uncertainty will be. LogReg-AL: the logistic regression active learning algorithm that measures the classification uncertainty based on the entropy of the distribution p(y|x). The larger the entropy of x is, the more uncertain we are about the class labels of x. Our Batch Mode Active Learning algorithm with logistic regression, i.e., LogReg-BMAL in short. 28 Empirical Evaluation Experimental Results with Reuters-21578 average results over 40 executions 100 training examples and 100 active examples 29 Empirical Evaluation Experimental Results with Reuters-21578 30 Empirical Evaluation Experimental Results with Reuters-21578 31 Empirical Evaluation Experimental Results with Web-KB Dataset 32 Empirical Evaluation Experimental Results with Newsgroup Dataset 33 Conclusion A batch mode active learning scheme is proposed to attack the challenge of large-scale text categorization. The main contributions include A new active learning scheme is suggested for largescale text categorization to overcome the limitation of traditional active learning; A batch mode active learning solution is formulated by convex optimization techniques; An effective bound optimization algorithm is proposed to solve the batch mode active learning problem Extensive experiments are conducted for empirical evaluations in comparisons with state-of-the-art active learning approaches for text categorization 34 Future Work To combine batch mode active learning with semisupervised learning To improve the computational costs To study the convergence of the bound optimization To extend the methodology for other classification models 35 Questions? Thank you for your attention! Http://www.cse.cuhk.edu.hk/~chhoi/ 36 Appendix A – Statistical Estimation Theory Given a semi-parametric model, say the logistic model as: In theory, one can use the maximum-likelihood estimate (MLE) to determine the model parameter as: In theory, the MLE achieves the Cramer-Rao lower bound, thus, the MLE is the asymptotically most efficient estimator, whose efficiency can be measured by the Fisher information that is intrinsic to the probability model. 37 Appendix A – Statistical Estimation Theory (cont.) More specifically, the expected log-likelihood to measure the goodness of q(x) can be given as Hence, according to the Crammer-Rao lower bound, the MLE based on the resample distribution q(x) that minimizes is the most efficient estimator of alpha among all estimators based on a resampling of x. Therefore, the result of q to solve the optimization is the optimal sample distribution for active learning. 38 Appendix B. Fisher Information and Cramer-Rao lower bound Fisher information is thought of as the amount of information that an observable random variable X carries about an unobservable parameter θ upon which the probability distribution of X depends. Since the expectation of the score is zero, the variance is also the second moment of the score and so the Fisher information can be written In statistics, the Cramér-Rao lower bounds express a lower bound on the accuracy of a statistical estimator, based on Fisher information. It states that the reciprocal of the Fisher information, , of a parameter θ, is a lower bound on the variance of an unbiased estimator of the parameter (denoted ). 39 Appendix C – Convexity Theorem Theorem. Any locally optimal point of a convex problem is (globally) optimal. 40 41 Appendix D – Semidefinite Programming (SDP) 42 Appendix E: Proof of Lemma1 Lemma 1: Let L(q) be the objective function in (15), we have the following conclusion Proof. 43 Proof (cont.): Using the convexity property of reciprocal function, namely for x ≥ 0 and p.d.f. . We can arrive the following deduction 44 Proof (cont.): Substituting the above inequation back to L(q), we can attain the following inequality: This finishes the proof of the inequality lemma. □ 45