Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS The Textbook Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schuetze We’ll go through one chapter each week Chapters to be Covered 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Introduction (this week) Linguistic Essentials Mathematical Foundations Mathematical Foundations (cont.) Collocations Statistical Inference Word Sense Disambiguation Markov Models Text Categorization Topics in Information Retrieval Clustering Lexical Acquisition Introduction Scientific basis for this inquiry Rationalist vs. Empirical Approach to Language Analysis – Justification for rationalist view: poverty of the stimulus – Can overcome this if we assume humans can generalize concepts Introduction Competence vs. performance theory of grammar – Focus on whether or not sentences are well-formed – Syntactic vs. semantic well-formedness – Conventionality of expression breaks this notion Introduction Categorical perception – Recognizing phonemes, works pretty well – But not for larger phenomena like syntax – Language change example as counter-evidence to strict categorizability of language » kind of/sort of -- change parts of speech very gradually » Occupied an intermediate syntactic status during the transition – Better to adopt a probabilistic view (of cognition as well as of language) Introduction The ambiguity of language – Unlike programming languages, natural language is ambiguous if not understood in terms of all its parts » Sometimes truly ambiguous too – Parsing with syntax only is harder than if using the underlying meaning as well Classifying Application Types Non-textual data Textual data Patterns Non-Novel Nuggets Novel Nuggets Standard data mining Database queries ? Computational linguistics Information retrieval Real text data mining Word Token Distribution Word tokens are not uniformly distributed in text – The most common tokens are about 50% of the occurrences – About 50% of the tokens occur only once – ~12% of the text consists of words occurring 3 times or fewer Thus it is hard to predict the behavior of many words in the text. Zipf’s “Law” Rank = order of words’ frequency of occurrence Histogram 350 300 250 Frequency The product of the frequency of words (f) and their rank (r) is approximately constant 200 Frequency 150 100 0 18 1 .1 35 6 .3 52 2 .4 69 8 .6 4 86 10 .8 3. 12 96 1. 13 12 8. 28 f C 1 / r C N / 10 50 Bin Consequences of Zipf There are always a few very frequent tokens that are not good discriminators. – Called “stop words” in Information Retrieval – Usually correspond to linguistic notion of “closed-class” words » English examples: to, from, on, and, the, ... » Grammatical classes that don’t take on new members. Typically Medium frequency words most descriptive – A few very common words – A middling number of medium frequency words – A large number of very infrequent words Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not (usually) the most descriptive. Order by Rank vs. by Alphabetical Order Other Zipfian “Laws” Conservation of speaker/hearer effort -> – Number of meanings of a word is correlated with its meaning – (there would be only one word for all meanings vs. only one meaning for all words) – m inversely proportional to sqrt(f) – Important for word sense disambiguation Content words tend to clump together – Important for computing term distribution models Is Zipf a Red Herring? Power laws are common in natural systems Li 1992 shows a Zipfian distribution of words can be generated randomly – 26 characters and a blank – The blank or any other character is equally likely to be generated. – Key insights: » There are 26 times more words of length n+1 than of length n » There is a constant ratio by which words of length n are more frequent than length n+1 Nevertheless, the Zipf insight is important to keep in mind when working with text corpora. Language modeling is hard because most words are rare. Collocations Collocation: any turn of phrase or accepted usage where the whole is perceived to have an existence beyond the sum of its parts. – Compounds (disk drive) – Phrasal verbs (make up) – Stock phrases (bacon and eggs) Another definition: – The frequent use of a phrase as a fixed expression accompanied by certain connotations. Computing Collocations Take the most frequent adjacent pairs – Doesn’t yield interesting results – Need to normalize for the word frequency within the corpus. Another tack: retain only those with interesting syntactic categories » adj noun » noun noun More on this later! Next Week Learn about linguistics! Decide on project participation