Download slides

Julius Information Extractor June 14, 2006 Kyle Woodward Lee-Ming Zen The Problem • There is a lot of text and information out there, but not a whole lot of tagging. How can we extract information a user is interested in without knowing anything beforehand? Approach • Based upon AT&T system • Build up “spelling” and “context” rules • Iteratively learn new rules by labeling and examining labels by jumping from one set of rules to the other • Additional features • We used a fixed length prefix and suffix to augment the context • Substituted POS instead of a full grammar parse for context • Window bounds selection to determine tag size • Web • Use information from web search snippets Rules • Rules are a set of features for a particular labeling with weights for each feature • e.g. allcap, contains, full-string, etc. What’s Cool • Generality • No restrictions on the type of data it runs against • No preassumed notions about the domain • GUI tools • Labeler • Statistics viewer • Works • Works well on small data sets What’s Not • Fails at larger corpora • Generality tradeoff means not being able to exploit certain information • Web context does not necessarily help due to noise

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides