Download Intro to Computational Linguistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
SIMS 296a-4
Text Data Mining
Marti Hearst
UC Berkeley SIMS
The Textbook


Foundations of Statistical Natural
Language Processing, by Chris Manning and
Hinrich Schuetze
We’ll go through one chapter each week
Chapters to be Covered
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Introduction (this week)
Linguistic Essentials
Mathematical Foundations
Mathematical Foundations (cont.)
Collocations
Statistical Inference
Word Sense Disambiguation
Markov Models
Text Categorization
Topics in Information Retrieval
Clustering
Lexical Acquisition
Introduction
Scientific basis for this inquiry
 Rationalist vs. Empirical Approach to
Language Analysis

– Justification for rationalist view:
poverty of the stimulus
– Can overcome this if we assume humans
can generalize concepts
Introduction

Competence vs. performance theory
of grammar
– Focus on whether or not sentences are
well-formed
– Syntactic vs. semantic well-formedness
– Conventionality of expression breaks
this notion
Introduction

Categorical perception
– Recognizing phonemes, works pretty well
– But not for larger phenomena like syntax
– Language change example as counter-evidence
to strict categorizability of language
» kind of/sort of -- change parts of speech very
gradually
» Occupied an intermediate syntactic status during the
transition
– Better to adopt a probabilistic view (of
cognition as well as of language)
Introduction

The ambiguity of language
– Unlike programming languages, natural
language is ambiguous if not understood
in terms of all its parts
» Sometimes truly ambiguous too
– Parsing with syntax only is harder than if
using the underlying meaning as well
Classifying Application Types
Non-textual
data
Textual data
Patterns
Non-Novel
Nuggets
Novel
Nuggets
Standard data
mining
Database
queries
?
Computational
linguistics
Information
retrieval
Real text
data mining
Word Token Distribution

Word tokens are not uniformly
distributed in text
– The most common tokens are about 50%
of the occurrences
– About 50% of the tokens occur only once
– ~12% of the text consists of words
occurring 3 times or fewer

Thus it is hard to predict the
behavior of many words in the text.
Zipf’s “Law”
Rank = order of words’
frequency of occurrence
Histogram
350
300
250
Frequency
The product of the
frequency of words
(f) and their rank
(r) is approximately
constant
200
Frequency
150
100
0
18 1
.1
35 6
.3
52 2
.4
69 8
.6
4
86
10 .8
3.
12 96
1.
13 12
8.
28
f  C 1 / r
C  N / 10
50
Bin
Consequences of Zipf

There are always a few very frequent
tokens that are not good discriminators.
– Called “stop words” in Information Retrieval
– Usually correspond to linguistic notion of
“closed-class” words
» English examples: to, from, on, and, the, ...
» Grammatical classes that don’t take on new members.

Typically

Medium frequency words most descriptive
– A few very common words
– A middling number of medium frequency words
– A large number of very infrequent words
Word Frequency vs. Resolving
Power (from van Rijsbergen 79)
The most frequent words are not (usually) the most descriptive.
Order by Rank vs. by
Alphabetical Order
Other Zipfian “Laws”

Conservation of speaker/hearer effort ->
– Number of meanings of a word is correlated
with its meaning
– (there would be only one word for all meanings
vs. only one meaning for all words)
– m inversely proportional to sqrt(f)
– Important for word sense disambiguation

Content words tend to clump together
– Important for computing term distribution
models
Is Zipf a Red Herring?


Power laws are common in natural systems
Li 1992 shows a Zipfian distribution of words can
be generated randomly
– 26 characters and a blank
– The blank or any other character is equally likely to be
generated.
– Key insights:
» There are 26 times more words of length n+1 than of length
n
» There is a constant ratio by which words of length n are
more frequent than length n+1

Nevertheless, the Zipf insight is important to
keep in mind when working with text corpora.
Language modeling is hard because most words are
rare.
Collocations

Collocation: any turn of phrase or accepted
usage where the whole is perceived to have
an existence beyond the sum of its parts.
– Compounds (disk drive)
– Phrasal verbs (make up)
– Stock phrases (bacon and eggs)

Another definition:
– The frequent use of a phrase as a fixed
expression accompanied by certain
connotations.
Computing Collocations

Take the most frequent adjacent
pairs
– Doesn’t yield interesting results
– Need to normalize for the word
frequency within the corpus.

Another tack: retain only those with
interesting syntactic categories
» adj noun
» noun noun

More on this later!
Next Week
Learn about linguistics!
 Decide on project participation
