Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSA2050 NLTK NLTK • A software package for manipulating linguistic data and performing NLP tasks • Advanced tasks are possible from an early stage • Permits projects at various levels • Consistent interfaces • Facilitates reusability of modules • Implemented in Python Chart Parsing with NLTK Why Python • Popular languages for NLP courses – Prolog (clean, learning curve, slow) – Perl (quick, syntax). • Why Python is better suited – Easy to learn, clean syntax – Interpreted, supporting rapid prototyping – Object oriented – Powerful NLTK Structure • NLTK is implemented as a set of minimally independent modules. • Core modules – Basic data types • Task Modules – Tokenising – Parsing – Other NLP tasks Token Class • The token class to encode information about NL texts. • Each token instance represents a unit of text such as a word, a text, or a document. • A given instance is defined by a partial mapping from property names to property values. The TEXT Property • The TEXT property is used to encode a token’s text content. >>> from nltk.token import * >>> Token(TEXT="Hello World!") <Hello World!> TAG • The TAG property is used to encode a token’s part of speech tag: >>> Token(TEXT="python",TAG="NN") <python/NN> SUBTOKENS • The SUBTOKENS property is used to store a tokenized text: >>> from nltk.tokenizer import * >>> tok = Token(TEXT="Hello World!") >>> WhitespaceTokenizer().tokenize(tok) >>> print tok[’SUBTOKENS’]) [<Hello>, <World!>] Augmenting the Token with Information • Language processing tasks are formulated as annotations and transformations involving tokens which add properties to the Token data structure. – word-sense disambiguation – chunking – parsing Blackboard Architecture • Typically these modifications are monotonic – they add information but do not delete it. • Tokens serve as a blackboard where information about a piece of text is collated. • This architecture contrasts with the more typical pipeline architecture where each stage destructively modifies the input information. • This approach was chosen because it gives greater flexibility when combining tasks into a single system. Other Core Modules • probability module defines classes for probability distributions and statistical smoothing techniques. • cfg module defines classes for encoding context free grammars (normal and probabilistic) • The corpus module defines classes for reading and processing different corpora. Using Brown Corpus >>> from nltk.corpus import brown >>> brown.groups() [’skill and hobbies’, ’popular lore’, ’humor’, ’fiction: mystery’, ...] >>> brown.items(’humor’) (’cr01’, ’cr02’, ’cr03’, ’cr04’, ’cr05’, ’cr06’, ’cr07’, ’cr08’, ’cr09’) >>> brown.tokenize(’cr01’) <[<It/pps>, <was/bedz>, <among/in>, <these/dts>, <that/cs>, <Hinkle/np>, <identified/vbd>, <a/at>, ...]> Penn Treebank >>> from nltk.corpus import treebank >>> treebank.groups() (’raw’, ’tagged’, ’parsed’, ’merged’) >>> treebank.items(’parsed’) [’wsj_0001.prd’, ’wsj_0002.prd’, ...] >>> item = ’parsed/wsj_0001.prd’ >>> sentences = treebank.tokenize(item) >>> for sent in sentences[’SUBTOKENS’]: ... print sent.pp() # pretty-print (S: (NP-SBJ: (NP: <Pierre> <Vinken>) (ADJP: (NP: <61> <years>) <old> ) ... Processing Modules • Each language processing algorithm is implemented as a class. • For example, the ChartParser and RecursiveDescentParser classes each define a single algorithm for parsing a text. • Each processing module defines an interface. • Interface classes are named with a trailing capital i, e.g. ParserI. • Such interface classes define one or more action methods that perform the task the module is supposed to perform. parse method parse_n method What is Python • Python is an interpreted, object-oriented, programming language with dynamic semantics. • Attractive for Rapid Application Development • Easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. • Python supports modules and packages, which encourages program modularity and code reuse. • Developed by Guido van Rossum in the early 1990s • Named after Monty Python • Open Source and free. • Download from www.python.org Why Python • • • • • • • Prolog clean, learning curve, slow Lisp old, syntax, big Perl quick, C#