Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid Topic Statistical Natural Language Processing Applies Machine Learning / Statistics to Learning : the ability to improve one’s behaviour at a specific task over time - involves the analysis of data (statistics) Natural Language Processing Following parts of the book Statistical NLP (Manning and Schuetze), MIT Press, 1999. Rationalism versus Empiricism Rationalist Noam Chomsky - innate language structures AI : hand coding NLP Dominant view 1960-1985 Cf. e.g. Steven Pinker’s The language instinct. (popular science book) Empiricist Ability to learn is innate AI : language is learned from corpora Dominant 1920-1960 and becoming increasingly important Rationalism versus Empiricism Noam Chomsky: But it must be recognized that the notion of “probability of a sentence” is an entirely useless one, under any known interpretation of this term Fred Jelinek (IBM 1988) Every time a linguist leaves the room the recognition rate goes up. (Alternative: Every time I fire a linguist the recognizer improves) This course Empiricist approach Focus will be on probabilistic models for learning of natural language No time to treat natural language in depth ! (though this would be quite useful and interesting) Deserves a full course by itself Covered in more depth in Logic, Language and Learning (SS 05, prob. SS 06) Ambiguity NLP and Statistics Statistical Disambiguation • Define a probability model for the data • Compute the probability of each alternative • Choose the most likely alternative NLP and Statistics Statistical Methods deal with uncertainty. They predict the future behaviour of a system based on the behaviour observed in the past. Statistical Methods require training data. The data in Statistical NLP are the Corpora Corpora Corpus: text collection for linguistic purposes Tokens How many words are contained in Tom Sawyer? 71.370 Types How many different words are contained in T.S.? 8.018 Hapax Legomena words appearing only once Word Counts word freq word freq the 3332 in 906 and 2972 that 877 a 1775 he 877 to 1725 I 783 of 1440 his 772 was 1161 you 686 it 1027 Tom 679 The most frequent words are function words Word Counts f 1 2 3 4 5 6 7 8 9 10 11-50 51-100 > 100 nf 3993 1292 664 410 243 199 172 131 82 91 540 99 102 How many words appear f times? About half of the words occurs just once About half of the text consists of the 100 most common words …. Word Counts (Brown corpus) Word Counts (Brown corpus) Zipf‘s Law word the and a he but be there one about more never Oh two f 3332 2972 1775 877 410 294 222 172 158 138 124 116 104 r f*r 1 3332 2 5944 3 5235 10 8770 20 8400 30 8820 40 8880 50 8600 60 9480 70 9660 80 9920 90 10440 100 10400 Zipf‘s Law: f~1/r word turned you‘ll name comes group lead friends begin family brushed sins Could Applausive f 51 30 21 16 13 11 10 9 8 4 2 2 1 (f*r = const) r f*r 200 10200 300 9000 400 8400 500 8000 600 7800 700 7700 800 8000 900 8100 1000 8000 2000 8000 3000 6000 4000 8000 8000 8000 Minimize effort Some probabilistic models N-grams Predicting the next word Artificial intelligence and machine …. Statistical natural language …. Probabilistic Regular (Markov Models) Hidden Markov Models Conditional Random Fields Context-free grammars (Stochastic) Definite Clause Grammars Illustration Wall Street Journal Corpus 3 000 000 words Correct parse tree for sentences known Constructed by hand Can be used to derive stochastic context free grammars SCFG assign probability to parse trees Compute the most probable parse tree Conclusions Overview of some probabilistic and machine learning methods for NLP Also very relevant to bioinformatics ! Analogy between parsing A sentence A biological string (DNA, protein, mRNA, …)