Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Open Source Text Mining Hinrich Schütze, Enkata Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco May 3, 2003 1 Motivation Open source used to be a crackpot idea. Bill Gates on linux (1999.03.24): “I really don't think in the commercial market, we'll see it in any significant way.” MS 10-Q quarterly filing (2003.01.31): “The popularization of the open source movement continues to pose a significant challenge to the company's business model.” Open source is an enabler for radical new things Google Ultra-cheap web servers Free news Free email Free … Class projects Walmart pc for $200 2 GNU-Linux 3 Web Servers: Open Source Dominates Source: Netcraft 4 Motivation (cont.) Text mining has not had much impact. Many small companies & small projects No large-scale adoption Exception: text-mining-enhanced search Text mining could transform the world. Unstructured → structured Information explosion Amount of information has exploded Amount of accessible information has not Can open source text mining make this happen? 5 Unstructured vs Structured Data 100 90 80 70 60 Unstructured Structured 50 40 30 20 10 0 Data volume Market Cap 6 Prabhakar Raghavan, Verity Business Motivation High cost of deploying text mining solutions How can we lower this cost? 100% proprietary solutions Require re-invention of core infrastructure Leave fewer resources for high-value applications built on top of core infrastructure 7 Definitions Open source Public domain, bsd, gpl (gnu public license) Text mining Like data mining but for text NLP (Natural Language Processing) subdiscipline Has interesting applications now More than just information retrieval / keyword search Usually: some statistical, probabilistic or frequentistic component 8 Text Mining vs. NLP (Natural Language Processing) What is not text mining: speech, language models, parsing, machine translation Typical text mining: clustering, information extraction, question answering Statistical and high volume 9 Text Mining: History 80s: Electronic text gives birth to Statistical Natural Language Processing (StatNLP). 90s: DARPA sponsors Message Understanding Conferences (MUC) and Information Extraction (IE) community. Mid-90s: Data Mining becomes a discipline and usurps much of IE and StatNLP as “text mining”. 10 Text Mining: Hearst’s Definition Finding nuggets Finding patterns Information extraction Question answering Clustering Knowledge discovery Text visualization 11 Information Extraction foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1 12 Knowledge Discovery: Arrowsmith Goal: Connect two disconnected subfields of medicine. Technique Start with 1st subfield Identify key concepts Search for 2nd subfield with same concepts Implemented in Arrowsmith system Discovery: magnesium is potential treatment for migraine 13 Knowledge Discovery: Arrowsmith 14 When is Open Source Successful? “Important” problem Adaptation A little adaptation is easy Most users do not need any adaptation (out of the box use) Incremental releases are useful Cost sharing without administrative/legal overhead Many users (operating system) Fun to work on (games) Public funding available (OpenBSD, security) Open source author gains fame/satisfaction/immortality/community Dozens of companies with significant interest in linux (ibm …) Many of these companies contribute to open source This is in effect an informal consortium A formal effort probably would have killed linux. Same applies to text mining? Also: bugs, security, high-availability, ideal for consulting & hardware companies like IBM 15 When is Open Source Not Successful? Boring & rare problem Complex integrated solutions Print driver for 10 year old printer QuarkXPress ERP systems Good UI experience for non-geeks Apple Microsoft Windows (at least for now) 16 Text Mining and Open Source Pro Important problem: fame, satisfaction, immortality, community can be gained Pooling of resources / critical mass Con Non-incremental? Most text mining requires significant adaptation. Most text mining requires data resources as well as source code. The need for data resources does not fit well 17 into the open source paradigm. Text Mining Open Source Today Lucene Rain/bow, Weka, GTP, TDMAPI Text mining algorithms / infrastructure, no data resources NLTK Excellent for information retrieval, but not much text mining. NLP toolkit, some data resources WordNet, DMOZ Excellent data resources, but not enough breadth/depth. 18 Open Source with Open Data Spell checkers (e.g., emacs) Antispam software (e.g., spamassassin) Named entity recognition (Gate/Annie) Free version less powerful than in-house 19 SpamAssassin: Code + Data 20 Open Data Resources: Examples SpamAssassin Named entity recognition Word lists, dictionaries Information extraction Classification model for spam Domain model, taxonomies, regular expressions Shallow parsing Grammars 21 Code vs Data Significant Resources Needed Text Classification N. Entity Recognition Information Extraction ? Spam Filtering Spell Checkers No Resources Needed Complex&Integrated SW Good UI Design Proprietary Code Linux Web Servers Open Source 22 Open Source with Data: Key Issues Can data resources be recycled? Assume there is a large library of data resources available. Problems have to be similar. More difficult than one would expect: my first attempt failed (medline/reuters). Next: case study How do we identify the data resources that can be recycled? How do we adapt them? How do we get from here to there? Need incremental approach that is sustained by successes along the way. 23 Text Mining without Data Resources Premise: “Knowledge-poor” text mining taps small part of potential of text mining. Knowledge-poor text mining examples Clustering Phrase extraction First story detection Many success stories 24 Case Study: ODP -> Case Study: Reuters Train on ODP Apply to Reuters 25 Case Study: Text Classification Key Issues for text classification Show that text classifiers can be recycled How can we select reusable classifiers for a particular task? How do we adapt them? Case Study Train classifiers on open directory (ODP) Apply classifiers to Reuters RCV1 165,000 docs (nodes), crawled in 2000, 505 classes 780,000 docs, >1000 classes Hypothesis: A library of classifiers based on ODP can be recycled for RCV1. 26 Experimental Setup Train 505 classifiers on ODP Apply them to Reuters Compute chi2 for all ODP x Reuters pairs Evaluate n pairs with the best chi2 Evaluation Measures Area under ROC curve Average precision Plot false positive rate vs true positive rate Compute area under the curve Rank documents, compute precision for each rank Average for all positive documents Estimated based on 25% sample 27 Japan: ODP -> Reuters ROC Curve Japan Classifier Trained on ODP Applied to Reuters 1.00 0.90 True Positive Rate 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 0.00 0.10 0.20 0.30 0.40 0.50 False Positive Rate 0.60 0.70 0.80 0.90 28 Some Results 29 BusIndTraMar0 / I76300: Ports 30 Discussion Promising results These are results without any adaptation. Performance expected to be much better after adaptation. 31 Discussion (cont) Class relationships are m:n, not 1:1 Reuters: GSPO SpoBasCol0 SpoBasMinLea0 SpoBasReg0 SpoHocIceLeaNatPla0 SpoHocIceLeaPro0 ODP: RegEurUniBusInd0 (UK industries) I13000 (petroleum & natural gas) I17000 (water supply) I32000 (mechanical engineering) I66100 (restaurants, cafes, fast food) I79020 (telecommunications) I9741105 (radio broadcasting) 32 Why Recycling Classifiers is Difficult Autonomous vs relative decisions ODP Japan classifier w/o modifications has high precision, but only 1% recall on RCV1! Most classifiers are tuned for optimal performance in embedded system. Tuning decreases robustness in recycling. Tokenization, document length, numbers Numbers throw off medline vs. non-medline categorizer (financial classified as medical) Length-sensitive multinomial Naïve Bayes: nonsensical results 33 Specifics What would an open source text classification package look like? Code Text mining algorithms Customization component Creation component To create new data resources Data To adapt recycled data resources Recycled data resources Newly created data resources Pick a good area Bioinformatics: genes / proteins Product catalogs 34 Other Text Mining Areas Named entity recognition Information extraction Shallow parsing 35 Data vs Code What about just sharing training sets? What about just sharing models? Often proprietary Small preprocessing changes can throw you off completely Share (simple?) classifier cum preprocessor and models Still proprietary issues 36 Open Source & Data Public Code+Data V1.1 Proprietar y adapt Enhanced Code+Data Code+Data V1.0 sanitize new release publish Sanitized& Enhanced Code+Data 37 Free Riders? Open source is successful because it makes free riding hard. Harder to achieve for some data resources Viral nature of GPL. Download models Apply to your data Retrain You own 100% of the result Less of a problem for dictionaries and grammars 38 Data Licenses Open Directory License http://rdf.dmoz.org/license.html Bsd flavor Wordnet http://www.cogsci.princeton.edu/~wn/license .shtml Copyright No license to sell derivative works? Some criteria for derivative works Substantially similar (seinfeld trivia) Potential damage to future marketing of derivative works 39 Code vs Data Licenses Some similarity If I open-source my code, then I will benefit from bug fixes & enhancements written by others. If I open-source my data resource, then my classification model may become more robust due to improvements made by others. Some dissimilarity Code is very abstract: few issues with proprietary information creeping in. Text mining resources are not very abstract: there is a potential of sensitive information 40 Areas in Need of Research How to identify reusable text mining components How to adapt reusable text mining components Active learning Interactive parameter tweaking? Combination of recycled classifier and new training information Estimate performance ODP/Reuters case study does not address this. Need (small) labeled sample to be able to do this? Most estimation techniques require large labeled samples. The point is to avoid construction of a large labeled sample. Create viral license for data resources. 41 Summary Many interesting research issues Need institution/individual to take the lead Need motivated network of contributors data resource contributors source code contributors Start with small & simple project that proves idea If it works … text mining could become an enabler on a par with linux. 42 More Slides 43 RegAsiJap0 JAP 0.86 0.62 RegAsiPhi0 PHLNS 0.91 0.56 RegAsiIndSta0 INDIA 0.85 0.53 SpoSocPla0 CCAT 0.60 0.53 RegEurRus0 CCAT 0.58 0.51 RegEurRus0 RUSS 0.85 0.51 SpoSocPla0 GSPO 0.78 0.42 SpoBasReg0 GSPO 0.75 0.33 RegAsiIndSta0 MCAT 0.56 0.32 SpoBasPla1 GSPO 0.80 0.31 SpoBasCol0 GSPO 0.78 0.31 SpoBasCol1 GSPO 0.74 0.26 RegEurSlo0 SLVAK 0.86 0.25 SpoBasPla0 GSPO 0.77 0.24 RegEurRus0 MCAT 0.49 0.23 BusIndTraMar0 I76300 0.81 0.23 SpoHocIceLeaPro0 GSPO 0.71 0.20 SpoBasMinLea0 GSPO 0.71 0.20 RegMidLeb0 LEBAN 0.83 0.19 RecAvi0 I36400 0.74 0.18 RegSou0 BRAZ 0.84 0.18 44 Resources http://www-csli.stanford.edu/~schuetze (this talk, some additional material) Source of Gates quote: http://www.techweb.com/wire/story/TWB19990324S0014 Kurt D. Bollacker and Joydeep Ghosh. A scalable method for classifier knowledge reuse. In Proceedings of the 1997 International Conference on Neural Networks, pages 1474-79, June 1997. (proposes measure for selecting classifiers for reuse) W.Cohen, D.Kudenko: Transferring and Retraining Learned Information Filters, Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI 97. (transfer within the same dataset) Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier architecture for scalable knowledge reuse. In The 1998 International Conference on Machine Learning, pp. 64-72, July 1998. (transfer within the same dataset) Motivation of open source contributors: http://newsforge.com/newsforge/03/04/19/2128256.shtml?tid =11, http://cybernaut.com/modules.php?op=modload&name=News&f 45 ile=article&sid=8&mode=thread&order=0&thold=0