* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Efficient Text Categorization with a Large Number of Classes
Survey
Document related concepts
Transcript
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University Some Recent Work Learning from Sequences of fMRI Brain Images (with Tom Mitchell) Learning to automatically build language-specific corpora from the web (with Rosie Jones & Dunja Mladenic) Effect of Smoothing on Naive Bayes for Text Classification (with Tong Zhang @ IBM Research) Hypertext Categorization using links and extracted information (with Sean Slattery & Yiming Yang) Hybrids of EM & Co-Training for semi-supervised learning (with Kamal Nigam) Error-Correcting Output Codes for Text Classification Text Categorization Domains: •Topics •Genres •Languages $$$ Making Numerous Applications •Search Engines/Portals •Customer Service •Email Routing …. Problems Practical applications such as web portal deal with a large number of categories A lot of labeled examples are needed for training the system How do people deal with a large number of classes? Use fast multiclass algorithms (Naïve Bayes) Builds one model per class Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems What happens with a 1000 class problem? Can we do better? ECOC to the Rescue! An n-class problem can be solved by solving log2n binary problems More efficient than one-per-class Does it actually perform better? What is ECOC? Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995) Use a learner to learn the binary problems Testing TrainingECOC ECOC f1 f2 f3 f4 A B C D 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 X 1 1 1 1 ECOC - Picture f1 f2 f3 f4 A B C D A B C D 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 ECOC - Picture f1 f2 f3 f4 A B C D A B C D 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 ECOC - Picture f1 f2 f3 f4 A B C D A B C D 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 ECOC - Picture f1 f2 f3 f4 A B C D A B C D 0 1 0 0 X 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1 0 ECOC works but… Increased code length = Increased Accuracy Increased code length = Increased Computational Cost E f f i c i e n c y Naïve Bayes GOAL ECOC (as used in Berger 99) Classification Performance Choosing the codewords Random? [Berger 1999, James 1999] Asymptotically good (the longer the better) Computational Cost is very high Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] Guaranteed properties for a fixed-length code Experimental Setup Generate the code BCH Codes Choose a Base Learner Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998) Text Classification with Naïve Bayes “Bag of Words” document representation Estimate parameters of generative model: N(word , doc) | V | N(doc ) 1 P( word | class ) docclass docclass Naïve Bayes classification: P(class ) P(class | doc) P( word | class ) worddoc P(doc) Industry Sector Dataset [McCallum et al. 1998, Ghani 2000] Consists of company web pages classified into 105 economic sectors Results Industry Sector Data Set Naïve Bayes 66.1% Shrinkage1 76% MaxEnt2 MaxEnt/ ECOC w Prior3 63-bit 79% 81.1% 88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost 1. (McCallum et al. 1998) 2,3. (Nigam et al. 1999) Min HD for correctly classified examples 1000 Frequency 800 600 400 200 0 0 1 2 3 4 5 6 7 8 9 10 9 10 Min HD Min HD for wrongly classified examples 100 Frequency 80 60 40 20 0 0 1 2 3 4 5 Min HD 6 7 8 ECOC for better Precision 1 ECOC NB 0.95 Precision 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.49 0.6 0.65 0.7 0.75 0.81 Recall 0.86 0.91 0.97 1 ECOC for better Precision 0.75 NB 0.7 15bit ECOC Precision 0.65 0.6 0.55 0.5 0.45 0.4 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 New Goal E f f i c i e n c y NB GOAL ECOC (as used in Berger 99) Classification Performance Solutions Design codewords that minimize cost and maximize “performance” Investigate the assignment of codewords to classes Learn the decoding function Incorporate unlabeled data into ECOC What happens with sparse data? % Decrease in Error Percent Decrease in Error with Training size and length of code 70 65 60 55 15bit 50 45 40 31bit 63bit 35 30 0 20 40 60 Training Size 80 100 Use unlabeled data with a large number of classes How? Use EM Mixed Results Think Again! Use Co-Training Disastrous Results Think one more time How to use unlabeled data? Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories ECOC works great with a large number of classes but there is no framework for using unlabeled data ECOC + CoTraining = ECoTrain ECOC decomposes multiclass problems into binary problems Co-Training works great with binary problems ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training ECOC+CoTrain - Results Algorithm 300L+ 0U 76 50L + 250U Per Class 67 5L + 295U Per Class 40.3 76.5 68.5 49.2 Uses Unlabeled Data 105Class Problem 68.2 51.4 67.6 50.1 Uses Unlabeled Data 72.0 56.1 Per Class Naïve Bayes ECOC 15bit EM Co-Train ECoTrain (ECOC + CoTraining) No Data Uses Unlabeled What Next? Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training Potential Drawbacks Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems Summary Use ECOC for efficient text classification with a large number of categories Increase Accuracy & Efficiency Use Unlabeled data by combining ECOC and Co-Training Generalize to domain-independent classification tasks involving a large number of categories