Download COMP 790: Statistical Language Processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
COMP 791A: Statistical
Language Processing
Introduction
Chap. 1
1
Course information




Prof: Leila Kosseim
Office: LB 903-7
Email: [email protected]
Office hours: TBA
2
Goal of NLP

Develop techniques and tools to build
practical and robust systems that can
communicate with users in one or more
natural language
Natural Lang.
Artificial Lang.
Lexical
>100 000 words
~100 words
Syntax
Complex
Simple
Semantic
1 word --> several
meanings
1 word --> 1 meaning
3
References




Foundations of Statistical Natural
Language Processing, by Chris Manning
and Hinrich Schutze, MIT Press, 1999.
Speech and Language Processing,
Daniel Jurafsky & James H. Martin.
Prentice Hall, 2000.
Current literature available on the Web.
See course Web page:
www.cs.concordia.ca/~kosseim/Teaching/COMP791-W04/
4
Other References

Proceedings of major conferences





ACL: Association for Computational Linguistics
EACL: European chapter of ACL
ANLP: Applied NLP
COLING: Computational Linguistics
TREC: Text Retrieval Conference
5
Who studies languages?




Linguist
 What constraints the possible meanings of a sentence?
 Uses mathematical models (ex. formal grammars)
Psycholinguist
 How do people produce a discourse from an idea?
 Uses: experimental observations with human subjects
Philosopher
 What is meaning anyways?
 How do words identify objects in the world?
 Uses: argumentations, examples and counter-examples
Computational Linguist (NLP)
 How can we identify the structure of sentences
automatically?
 Uses: data structures, algorithms, AI techniques (search,
knowledge-representation, machine learning, …)
6
Why study NLP?

necessary to many useful applications:







information retrieval,
information extraction,
filtering,
spelling and grammar checking,
automatic text summarization,
understanding and generation of natural language,
machine translation…
7
Who needs NLP?

Too many texts to manipulate




On Internet
E-mails
Various corporate documentation
Too many languages

39000 languages and dialects
8
Languages on the Internet
Source: Global Reach (www.glreach.com)
9
Source: Global Reach (www.glreach.com)
10
Applications of NLP

Text-based: processing of written texts
(ex. Newspaper articles, e-mails, Web pages…)
 Text understanding/analysis (NLU)



IR, IE, MT, …
Text generation (NLG)
Dialog-based systems (human-machine
communication)

Ex: QA, tutoring systems, …
11
Brief history of NLP


1940s - 1950s Foundational Insights
 Automata, finite-state machines & formal languages (Turing,
Chomsky, Backus&Naur)
 Probability and information theory (Shannon)
 Noisy channel and decoding (Shannon)
1960s - 1970s Two Camps
 Symbolic: Linguists & Computer Scientists




Transformational grammars (Chomsky, Harris)
Artificial Intelligence (Minsky, McCarthy)
Theorem Proving, heuristics, general problem solver (Newell&Simon)
Stochastic: Statisticians & Electrical Engineers



Bayesian reasoning for character recognition
Authorship attribution
Corpus Work
12
Brief history of NLP (con’t)

1970s - 1980s 4 Paradigms





Stochastic approaches
Logic-based / Rule-based approaches
Scripts and plans for NL understanding of “toy worlds”
Discourse modeling (discourse structures & coreference
resolution)
Late 1980s - 1990s Rise of probabilistic models



Data-driven probabilistic approaches (more robust)
Engineering practical solutions using automatic learning
Strict evaluation of work
13
Why study NLP Statistically?


Up to about 10 years, NLP was mainly investigated
using a rule-based approach.
But:



Rules are often too strict to characterize people’s use
of language (people tend to stretch and bend rules in
order to meet their communicative needs.)
Need (expert) people to develop rules (knowledge
acquisition bottleneck)
Statistical methods are more flexible & more
robust
14
Tools and Resources Needed

Probability/Statistical Theory:


Linguistics Knowledge:


Statistical Distributions, Bayesian Decision Theory.
Morphology, Syntax, Semantics, Pragmatics…
Corpora:



Bodies of marked or unmarked text
to which statistical methods and current linguistic
knowledge can be applied
in order to discover novel linguistic theories or
interesting and useful knowledge to build applications.
15
The Alphabet Soup











NLP  Natural Language Processing
CL  Computational Linguistics
NLE  Natural Language Engineering
HLT  Human Language Technology
IE  Information Extraction
IR  Information Retrieval
MT  Machine Translation
QA  Question-Answering
POS  Part-of-speech
NLG  Natural Language Generation
NLU  Natural Language Understanding
16
Why is NLP difficult?

Because Natural Language is highly ambiguous.

Syntactic ambiguity


I made her duck.
has 2 parses (i.e., syntactic analysis)
(S (NP I)
(VP (V made)
(NP (PRO her)
(N duck)))



(S (NP I)
(VP (V made)
(NP (PRO her)
(VP (V duck))))
The president spoke to the nation about the problem of drug
use in the schools from one coast to the other.
has 720 parses.
Ex:



“to the other” can attach to any of the previous NPs (ex. “the
problem”), or the head verb  6 places
“from one coast” has 5 places to attach
…
17
Why is NLP difficult? (con’t)

Word category ambiguity


Word sense ambiguity


People on mars can fly.
Defining scope



make up a story
Fictitious worlds


bank --> financial institution? building? or river side?
Words can mean more than their sum of parts


book --> verb? or noun?
People like ice-cream.
Does this mean that all (or some?) people like ice cream?
Language is changing and evolving


I’ll email you my answer.
This new S.U.V. as a compartment for your mobile phone.
18
Methods that do not work well

Hand-coded rules



produce a knowledge acquisition bottleneck
perform poorly on naturally occurring text
Ex: Hand-coded syntactic constraints and preference rules
 Ex: selectional restrictions
animate being --> swallow--> physical object
I swallowed his story / line.
The supernova swallowed the planet.
19
What Statistical NLP can do

seeks to solve the acquisition bottelneck:


by automatically learning preferences from corpora
(ex, lexical or syntactic preferences).
offers a solution to the problem of ambiguity
and "real" data because statistical models



are robust
generalize well
behave gracefully in the presence of errors and new
data.
20
Some standard corpora





Brown corpus
 ~1 million words
 Tagged corpus (POS)
 Balanced (representative sample of American English in the
1960-1970) (different genres)
Lancaster-Oslo-Bergen (LOB) corpus
 British replication of the Brown corpus
Susanne corpus
 Free subset of Brown corpus (130 000 words)
 Syntactic structure
Penn Treebank
 Syntactic structure
 Articles from Wall Street Journal
Canadian Hansard
 Bilingual corpus of parallel texts
21
What to do with text corpora?
Count words

Count words to find:
 What are the most common words in the text?

How many words are in the text?


word tokens vs word types
What is the average frequency of each word in
the text?
22
What’s a word anyways?

I have a can opener; but I can’t open these cans.
how many words?

Word form




Lemma



a set of lexical forms having the same stem, same POS and same
meaning
can and cans … same lemma
Word token:



inflected form as it appears in the text
can and cans ... different word forms
an occurrence of a word
I have a can opener; but I can’t open these cans. 11 word tokens (not
counting punctuation)
Word type:


a different realization of a word
I have a can opener; but I can’t open these cans. 10 word types
counting punctuation)
(not
23
An example

Mark Twain’s Tom Sawyer




71,370 word tokens
8,018 word types
tokens/type ratio = 8.9 (indication of text complexity)
Complete Shakespeare work



884,647 word tokens
29,066 word types
tokens/type ratio = 30.4
24
Common words in Tom Sawyer
but words in NL have an uneven distribution…
25
Frequency of frequencies

most words are rare



3993 (50%) word types appear only
once
they are called happax legomena
(read only once)
but common words are very common

100 words account for 51% of all
tokens (of all text)
26
Word counts are interesting...



As an indication of a text’s style
As an indication of a text’s author
But, because most words appear very
infrequently,
it is hard to predict much about the behavior
of words (if they do not occur often in a
corpus)
--> Zipf’s Law


27
Zipf’s Law
1.
2.

Count the frequency of each word type in a large
corpus
List the word types in order of their frequency
Let:




f = frequency of a word type
r = its rank in the list
Zipf’s Law says: f  1/r
In other words:


there exists a constant k such that: f × r = k
The 50th most common word should occur with 3 times
the frequency of the 150th most common word.
28
Zipf’s Law on Tom Saywer


k ≈ 8000-9000
except for
The
3 most frequent words
Words of frequency ≈ 100
29
Plot of Zipf’s Law
On chap. 1-3 of Tom Sawyer
f×r = k
(≠ numbers from p. 25&26)
Zipf
350
300
Freq
250
200
150
100
50
0
0
500
1000
1500
2000
Rank
30
Plot of Zipf’s Law (con’t)
On chap. 1-3 of Tom Sawyer
f×r = k ==> log(f×r) = log(k) ==> log(f)+log(r) = log(k)
Zipf's Law
6
5
log(freq)
4
3
2
1
0
0
1
2
3
4
5
6
7
8
log(rank)
31
Zipf’s Law, so what?

There are:




Principle of Least effort: Tradeoff between speaker and
hearer’s effort



A few very common words
A medium number of medium frequency words
A large number of infrequent words
Speaker communicates with a small vocabulary of common words
(less effort)
Hearer disambiguates messages through a large vocabulary of
rare words (less effort)
Significance of Zipf’s Law for us:


For most words, our data about their use will be very sparse
Only for a few words will we have a lot of examples
32
Another Zipf law on language

Nb of meanings of a word is correlated to its frequency

the more frequent a word, the more senses it can have
m  f or m 

Ex:




1
r
f = frequency of word
m = num of senses
r = rank of word
Words at rank 2,000 have 4.6 meanings
Words at rank 5,000 have 3 meanings
Words at rank 10,000 have 2.1 meanings
Ex: Verb senses in WordNet:


serve has 13 senses
but most verbs have only 1 sense
33
Yet another Zipf law on language

Content words tend to "clump" together


if we take a text and count the distance between
identical words (tokens)
then the freq of intervals of size s between identical
tokens is inversely proportional to the size s
1
f p
s



f = frequency of intervals of size s
s = size of interval
p = varied between 1 and 1.3
i.e. we have a large number of small intervals
i.e. we have a small number of large intervals
--> most content words occur near each other
xxx
xxx
xxx
xxx
34
What to do with text corpora?
Find Collocations

Collocation: a phrase where the whole expression
is perceived as having an existence beyond the
sum of its parts


important for machine translation



disk drive, make up, bacon and eggs…
strong tea --> thé fort
strong argument -->?argument fort (convainquant)
can be extracted from a text



find the most common bigrams
however, since these bigrams are often insignificant
(ex, “at the”, “of a”)
they can be filtered.
35
Collocations
Raw bigrams
Filtered bigrams
36
What to do with text corpora?
Concordances


Find the different contexts in which a word occurs.
Key Word In Context (KWIC) concordancing program.
37
Concordances

useful for:

Finding syntactic frames of verbs



Transitive? Intransitive?
Building dictionaries for learners of foreign
languages
Guiding statistical parsers
38