Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SIG New Grad 2010 Vijay Shanker General Info • Research interests: Natural Language Processing, Text Mining, Machine Learning, NLP/Software Engineering • Funding: NSF, NIH/NLM, USDA • Teaching: Logic, Theory of Computation, Machine Learning/Data Mining, Topics in Natural Language Processing Machine Learning with Minimal Supervision • Supervised Machine Learning requires annotated data – what should the output be for each instance. • More success leads to wider use, but – Lot of data to be annotated – Often requires expert annotator • Minimal Supervision – Active learning – Domain Adaptation – Semi-supervised or unsupervised Active Learning Michael Bloodgood (2009) • The learned model decides which instance should be annotated next. • Mike’s dissertation focused on active learning in a situation where there is data imbalance • Partially learned model (SVM) would figure out the hard instances • For Imbalance, adjusted model to penalize errors on minority class according to degree of imbalance • Statistical methods to infer the degree of imbalance with high confidence • Success in a variety of IR (e.g., text classification), NLP and other domains Active Learning Contd. • When do we stop? A fundamental but understudied research problem • When performance (accuracy is good enough) – but we can’t use annotated data to decide • Method based on stabilization of predictions • ICML workshop on active Learning (2008), HLT/NAACL (2009), CoNLL (2009) Domain Adaptation • ML/Statistical NLP methods not successful when application domain is different from training domain • Use same model but adapt to domain • Figure out what is independent of domain and what varies • Use unannotated data from domain to make a good guess and bootstrap with learned model • Example– POS tagging (J. Miller & M. Torii) – Syntax same but lexicon different (EMNLP 2008) Learning from Positive Data • Existence of incomplete databases (especially in Bioinformatics) • Databases only provide positive data • But can’t learn without negative data • Again issues involving bootstrapping, choosing examples, imbalance etc. Text Mining and Information Extraction • Mining – look for nuggets • Text Mining – look for information/patterns in large amounts of text • Information Extraction – look for specific kind of information from documents (text, websites, etc) • Relation Extraction – two or more entities in specific relation – from unstructured data to structured data (Databases) • Employer-Employee relation, 2 proteins that interact, books and authors • Much of my current work on biomedical research articles Text Mining eGIFT -- Oana Tudor • Large scale experimental data involve thousands of genes • Tens of thousands of genes. Biologists might know about a couple of hundred genes well • Incomplete databases • Still most of information in the literature only, besides the papers give the relevant context to interpret information – but significant IR issues in search for genes eGIFT • Addresses gene based search (IR) • Compares gene specific literature vs background set to identify “key terms” • Categorizes terms • Links to sentences and literature • Picking one good sentence to suggest relation • Summary of gene and its properties DNA Topoisomerase II alpha • Identify “keywords” without any human intervention • Top2a literature -- 758 abstracts – nuclear (0.3), break (0.1), corepressor (0.003) • Background -- 1.98 million abstracts – nuclear (0.1), break (0.001), corepressor (0.05) • Statistical tests for significance • Extend or not? – Nuclear, double strand break, anticancer, chromosome segregation, … Identify Sentences • Pick sentences with keywords based on where they appear, where gene name appears, # of keywords etc. – DNA Topoisomerase IIA expression levels are related to tumor growth and to resistance to anticancer chemotherapy – DNA Topoisomerase IIalpha is an essential enzyme for chromosome segregation during mitosis – DNA topoisomerase II is a nuclear enzyme whose decatenating activity on newly replicated DNA is essential to successful cell division • SMBM 2008, BMC Bioinformatics 2010 Current and Future Extensions • Selecting and Ranking sentences – SVM rank – Features – vagueness and sentence complexity • Summarizing a gene and its properties – wikigene • Assist in interpretation of large scale experiments, hypothesis generation and knowledge discovery Information Extraction • Several relation extraction tasks, many from biomedical text • Research issues are not specific to any domain or task – Beyond sentences – What kind of features (lexical, syntactic, semantic features) help? – Learn starting from positive data – Could sentence simplification help? Text Simplification for RE • Sentences in WSJ, paper abstracts, etc. are quite structurally complicated • To humans the information to be extracted may look straightforward • But not straightforward for systems, e.g., because parsing (especially in new domains) remains a hard task. So often these systems rely on much “shallower” features. Simplification Example • MAPK phosphorylates BCL-2 at Serine 112. • MAPK isolated from rat brain tissue phosphorylates BCL-2 at Serine 112. • MAPK phosphorylates BCL-2, a member of the BAX family, at Serine 112. • MAPK phosphorylates BCL-2 at Serine 112 and BAD at Tyrosine 25. Sentence Simplification • MAPK isolated from rat brain tissue phosphorylates BCL-2 at Serine 112. – MAPK phosphorylates BCL-2 at Serine 112. – MAPK was isolated from rat brain tissue. • MAPK phosphorylates BCL-2 at Serine 112 and BAD at Tyrosine 25. – MAPK phosphorylates BCL-2 at Serine 112. – MAPK phosphorylates BAD at Tyrosine 25. NLPA for SE • Collaboration with Lori Pollock • Code as a different type of NL document • Abbreviation expansion -- wcStrBrk – Acronyms – International Business Machine (IBM) vs AuctionEntry ae; • Importance of verb object relation – NL – verb classification e.g., drink, sip, gulp vs. eat, bite, … – No subject; object in name or first parameter Software Word Usage Model • SWUM – virtual remodularization (OO vs procedural) – where in the code is some action completed – Software search -- query expansion (Shepherd etal AOSD 2006), query reformulation (Hill etal ICSE 2009) – Software navigation – Hill et al ASE 2009 – Comment generation – Giri et al ASE 2010, ICSE 2011 Summary Comment Generation • Giriprasad ASE 2010 • NL vs software code – Location in document – differing importance – Flow of information – lexical chains vs control and data flow • High level actions (ICSE 2011) – Structural aspects, code conventions and design patterns • Text Segmentation vs code segmentation – Readability (para), refactoring code, internal comments Naming Conventions Classes and methods • Call site – imperative statements vs. comment – – – – saveAuction(ae) savesAuction() or c.contains(str) auctionSaved() savedAuction • Boolean returns – proposition and sentential window.isClosed() • Verbs to nouns. Tyranny of nouns. – Traditional vs action-oriented classes – book, sequenceManager, xmlParser Other Projects • Colin Kern – with Prof. Liao – Transmembrane protein structure prediction – Incorporating local and long distance relations • Dan Blanchard – with Prof. Jeffrey Heinz – Unsupervised word segmentation (e.g., speech) – What if we didn’t have a lexicon? Infant language acquisition