Download Introduction to Computational Linguistics Context Free Grammars

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Zulu grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Preposition and postposition wikipedia , lookup

Portuguese grammar wikipedia , lookup

Arabic grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Chinese grammar wikipedia , lookup

Dependency grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Context-free grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Yiddish grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Vietnamese grammar wikipedia , lookup

Determiner phrase wikipedia , lookup

Parsing wikipedia , lookup

English grammar wikipedia , lookup

Transformational grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Transcript
Introduction to Computational
Linguistics
Context Free Grammars
Slide 1
[Revised Version]
Week 5, Lectures 1 & 2
November 7, 2002
Today
Phrase Structure and Context Free Grammars
– What is a CFG?
– Linguistic examples of CFGs
Slide 2
Constituents
– Linguistic problem areas for CFGs
– Grammar and spoken language
Part of Speech and Tagging
Main points...
Parts of speech represent classes of words based on morphological and
distributional similarities.
Accurate part of speech tagging is pretty much a solved problem (at least for
Slide 3
English) by any one of a number of statistical or rule-based approaches.
For situations where there is no tagged training data one can fall back on
unsupervised HMM training, but results are still poor.
Syntax
By SYNTAX, we mean various aspects of how words are strung together to form
components of sentences and how those components are strung together to form
sentences.
New concept: C ONSTITUENCY
Slide 4
groups of words may behave as a single unit or CONSTITUENT, e.g., noun
phrase
evidence
– the whole group appears in similar syntactic environments, e.g., before a
verb
– preposed/postposed constructions
Syntax in CL
Syntactic analysis is used to varying degrees in applications such as:
Grammar checkers
Spoken Language Understanding
Question answering systems
Slide 5
Information extraction
Automatic Text Generation
Machine Translation
Typically, detailed syntactic analysis is taken to be a prerequisite for detailed
semantic interpretation.
Context Free Grammars (CFGs)
These:
capture constituency and ordering;
formalise descriptive linguistic work of the ’40s and ’50s;
are widely used in linguistics.
Slide 6
CFGs don’t directly encode grammatical relations (such as SUBJECT of a verb) or
dependency relations (such as HEAD of a clause); however, to a certain extent, they
can be defined in terms of constituency.
CFGs are somewhat oriented towards languages like English which have relatively
fixed word order.
Most modern linguistic theories of grammar incorporate some notions from context
free grammar, to a lesser or greater extent.
Context Free Grammars
A CFG is a 4-tuple, and defines a formal language
a set of non-terminal symbols
a set of terminal symbols
a set of productions
Slide 7
–
–
–
:
is a non-terminal
is a string of symbols from the infinite set of strings
a designated start symbol
CFGs are equivalent to Backus-Naur form (BNF) grammars.
Example
S
NP VP
NP
N om
VP
Slide 8
D et
N
V
NP
D et N om
N
V
a
flight
left
= ‘noun phrase’, VP = ‘verb phrase’, Det = ‘determiner’, Nom = ‘Nominal’, N =
‘noun’, V = ‘verb’.
Generativity
As with FSAs, we can view a CFG as a device for (i) assigning structure to a given
sentence, or (ii) generating sentences
Sentences that can be DERIVED by a grammar belong to the formal language
defined by the grammar, and are called G RAMMATICAL S ENTENCES
Slide 9
Those sentences that cannot be derived are U NGRAMMATICAL S ENTENCES
The formal language,
, defined by a grammar
is the set of strings composed
of terminal symbols that are derivable from the START SYMBOL:
and
derives
PARSING is mapping from a word string to its parse tree
Derivations and Trees
A DERIVATION of a string is just a sequence of rule applications
Derivations can be visualized as PARSE TREES.
NP
Slide 10
Det
Nom
a
N
flight
We say that a flight can be DERIVED from the non-terminal NP.
Equivalence and Normal Form
S TRONG E QUIVALENCE
two grammars generate the same set of strings and assign the same phrase
structures
W EAK E QUIVALENCE
Slide 11
two grammars generate the same set of strings but do not assign the same
phrase structures
C HOMSKY N ORMAL F ORM (CNF)
productions of the form
or
upper case indicates non-terminals, lowercase indicates terminals
Equivalence and Normal Form: Example
Original
CNF
Slide 12
What type of equivalence?
What’s this ‘Context’?
The use of the term CONTEXT FREE in the description of this formalism has nothing
to do with the ordinary use of the word “context”.
All it really means is that there is a single non-terminal on the left-hand side of the
rule, e.g.,
Slide 13
In other words, we can rewrite an
in which we find the
.
as
, regardless of the (string/tree) context
CFG: As opposed to what?
Regular Grammars — generally claimed to be too weak to capture lingistic
generalizations.
Context Sensitive grammars — generally regarded as too strong.
Recursively Enumerable (Type 0) Grammars — generally regarded as way too
Slide 14
strong. (But it has sometimes be suggested that NLs aren’t even RE!)
RE vs. Recursive — we seem to know when strings are ungrammatical.
Approaches that are TOO STRONG have the power to predict/describe/capture
syntactic structures that don’t exist in human languages. However, it is generally
agreed that CFGs aren’t quite strong enough.
The computational processes associated with stronger formalisms are not as
efficient as those associated with weaker methods.
The Chomsky Hierarchy
L ANGUAGE T YPE
AUTOMATON
G RAMMARS
E XEMPLAR
Type 0:
Recursively Enumerable
Universal Turing Machine
Linear Bounded Automaton
Type 1:
Context sensitive
Slide 15
Type 2:
Context Free
Push-Down Automaton
Finite-state Automaton
Type 3:
Finite State
where
etc. are variables over nonterminal symbols,
terminal symbols or words,
etc. are variables over
etc. are variables over strings of terminal and/or
non-terminal symbols, and , , etc. are terminal symbols.
Developing Grammars
A huge amount of skilled effort goes into the development of grammars for human
languages — can only scratch the surface here.
Our primary question: What constitutes a constituent?
Answer: A consistent, recurring, syntactic substructure.
Slide 16
A Tiny Lexicon
is prefer
V
A dj
Slide 17
Pro
Proper-Noun
Conj
cheapest
other
like need
want
fly
latest
...
you it
non-stop first
direct
me I
...
Alaska Baltimore Los Angeles
Chicago United
D et
P
flights breeze trip morning . . .
N
American . . .
from to on near and or but . . .
the a an this these that
...
...
A Tiny Grammar
S
NP
Nominal
Slide 18
VP
PP
NP VP
I + want a morning flight
Pro
I
Proper-Noun
Los Angeles
Det Nom
a + flight
N Noml
morning + flight
Noun
flights
V
do
V NP
want + a flight
Verb NP PP
leave + Boston + in the morning
V PP
leaving + on Thursday
P NP
from + Los Angeles
Example Phrase Structure Tree
S
NP
Slide 19
VP
Pro
V
I
prefer
NP
Det
a
Nom
Noun
Noun
morning
flight
Key Constituents
The key constituents that we will look at are the following:
Sentences
Noun phrases
Verb phrases
Slide 20
Prepositional phrases
Common Sentence Types
Declaratives
John left.
S
NP VP
Imperatives
Slide 21
Leave!
S
VP
Yes-No Questions
Did John leave?
Aux NP VP
S
WH Questions (who, what, where, which, why, how )
When did John leave?
S
Wh-XP Aux NP VP
Noun Phrases (simplified)
Revolves around a HEAD (the central noun)
P RENOMINAL modifiers come before the head noun:
NP
Slide 22
(Det) (Card) (Ord) (Quant) (AP) Nom
NB. Placing parentheses around a category on the rhs of a rule indicates that the
constituent in question is OPTIONAL.
Determiners (Det)
a flight
Cardinal Numbers (Card)
one stop
Noun Phrases, cont.
Ordinal Numbers (Ord)
the first stop
Quantifiers (Quant)
Slide 23
many fares
Adjective Phrases (AP)
a fast flight
Noun Phrases (continued)
P OSTNOMINAL modifiers come after the head noun
N om
N om
N om
Slide 24
N om PP
N om VP
N om S [+rel]
Prepositional Phrase (PP):
the flights from San Francisco to Boston
Nonfinite VPs:
VP[+ing]: flights leaving on Thursday
VP[+to]: the last flight to arrive
VP[+ed]: the aircraft used by USAir
Relative Clauses (S[+rel])
S [+rel]
NP [+rel] VP
flights that serve lunch
S [+rel]
NP [+rel] S [+gap]
flight which you booked
Postnominal modifiers can be combined
Slide 25
a flight [from San Francisco] [leaving Monday]
a flight [which I’d booked] [which never departed]
Recursive Structures
There is no upper bound on the length of a grammatical English sentence.
– Therefore the set of English sentences is infinite.
A grammar is a finite statement about well-formedness.
– To account for an infinite set, it has to allow iteration (e.g.,
Slide 26
) or recursion.
R ECURSIVE RULES: where the non-terminal on the left-hand side of the arrow in
a rule also appears on the right-hand side of a rule.
Recursive Structures, cont.
Direct recursion:
N om
VP
N om PP
flight to Boston
departed Miami at noon
VP PP
Indirect recursion:
Slide 27
S
VP
NP VP
said that the flight was late
V S
S
NP
Slide 28
VP
Pro
V
they
said
S
NP
VP
Pro
V
he
claimed
S
NP
VP
Pro
V
she
lied
Is English a Regular Language?
Chomsky (1957):
1.
2. If
, then/*or
) is not a regular language.
.
. [either or ], then
. ( ) 5. If [either [if , then ], or ], then . ( )
4. If
3. Either
Slide 29
(e.g.,
, *then/or
“Note that many of the sentences of the form [(4)], etc., will be quite strange
and unusual . . . . But they are all grammatical sentences, formed by
processes of sentence construction so simple and elementary that even the
most rudimentary grammar of English would contain them.” Chomsky
(1957, p.23)
Is English a Regular Language, cont
(1)
a. If Kim comes, then Lee will come.
b. If either Rusty comes or Sandy stays at home, then Lee will come.
c. If either if Kim comes then she’ll bring Sandy or Sandy stays at home,
then Lee will come.
Slide 30
(2)
a. The mouse died.
b. The mouse [that the cat chased] died.
c. The mouse [that the cat [that the dog chased] bit] died.
d. The mouse [that the cat [that the dog [that the child fed] chased] bit]
died.
Depth restriction seems quite robust (
)
Is the performance/competence distinction appropriate?
Coordination
NP
VP
S
Slide 31
S
NP
and NP
VP
and VP
and S
(3)
I need [[NP the times] and [NP the fares]].
(4)
a flight [VP departing at 9a.m.] and [VP returning at 5p.m.]]
(5)
[[S I depart on Wednesday] and [S I’ll return on Friday]].
In fact, any phrasal constituent can be conjoined with a constituent of the same type
to form a new constituent of that type. So, coordination in English is better
expressed by a schema of the following sort (where XP is an arbitrary phrase):
XP
XP
and XP
Syntactic Ambiguity
Many kinds of syntactic (structural) ambiguity.
PP
attachment has received much attention:
VP
Slide 32
VP
V
NP
V
saw
Nom
saw
Nom
PP
the man
with a telescope
NP
PP
the man
with a telescope
PP Ambiguity
Different structures naturally correspond to different semantic interpretations
(‘readings’)
Arises from independently motivated syntactic rules:
VP
Slide 33
NP
V
. . . PP
N om PP
However, also strong, lexically influenced, preferences:
– I bought [a book [on linguistics]]
– I bought [a book] [on sunday]
Problem Areas for CFGs
Agreement
Subcategorization
Movement
Slide 34
Number Agreement
In English, some determiners agree in number with the head noun:
Slide 35
(6)
This dog
(7)
Those dogs
(8)
* Those dog
(9)
* This dogs
And verbs agree in number with their subjects:
(10)
What flights leave in the morning?
(11)
* What flight leave in the morning?
Number Agreement, cont.
Expand our grammar with multiple sets of rules?
NP sg
NP pl
S sg
Slide 36
S pl
D etsg N sg
D etpl N pl
NP sg VP sg
VP sg
VP pl
NP pl VP pl
V sg ( NP ) ( NP ) ( PP )
V pl ( NP ) ( NP ) ( PP )
worse when we add person and even worse in languages with richer agreement
(e.g., three genders).
lose generalizations about nouns and verbs — can’t say property
words of category V.
is true of all
Subcategorization
Verbs have preferences for the kinds of constituents they co-occur with.
(12)
I found the cat
(13)
Slide 37
* I disappeared the cat.
A traditional subcategorization of verbs:
transitive (takes a direct object NP)
intransitive
In more recent approaches, there might be as many as a hundred
subcategorizations of verb.
Subcategorization, cont.
More examples:
find is subcategorized for an NP (can take an NP complement)
want is subcategorized for an NP or an infinitival VP
bet is subcategorized for NP NP S
A listing of the possible sequences of complements is called the
Slide 38
SUBCATEGORIZATION FRAME
for the verb.
As with agreement, the obvious CFG solution yields rule explosion:
VP
V ditr NP NP
VP
V intr
VP
VP
V tr NP
V ditrS NP NP S
Example Subcategorization Frames
Frame
Verb
Example
eat, sleep
I want to eat
NP
prefer, find, leave,
Find [NP the flight from Pittsburgh to Boston]
NP NP
show, give
Show [NP me] [NP airlines with flights from
Pittsburgh]
Slide 39
PP PP
fly, travel
I would like to fly [PP from Boston] [PP to
Philadelphia]
NP PP
help, load,
Can you help [NP me] [PP with a flight]
VP [+inf]
prefer, want, need
I would prefer [VP [+inf] to go by United airlines]
VP [+base]
can, would, might
I can [VP [+base] go from Boston]
S
mean
Does this mean [S AA has a hub in Boston]?
Movement (or Unbounded Dependency)
Constructions
(14)
a.* I gave
to the driver.
b. I gave some money to the driver.
(15)
Slide 40
a. $5 [I gave
to the driver], (and $1 I gave to the porter).
b. He asked how much [I gave
to the driver].
c. I forgot about the money which [I gave
to the driver].
(16)
How much did you think [I gave
to the driver]?
(17)
How much did you think he claimed [I gave
(18)
How much did you think he claimed that I said [I gave
(19)
...
to the driver]?
to the driver]?
Spoken Language
Slide 41
the . [exhale] . . . [inhale] . . [uh] does American airlines . offer any
. one way flights . [uh] one way fares, for one hundred and sixty one
dollars
[mm] i’d like to leave i guess between [um] . [smack] . five o’clock no,
five o’clock and [uh], seven o’clock . P M
around, four, P M
all right, [throat clear] . . i’d like to know the . give me the flight . times
. in the morning . for September twentieth . nineteen ninety one
[uh] one way
[uh] seven fifteen, please
on United airlines . . give me, the . . time . . from New York . [smack]
. to Boise-, to . I’m sorry . on United airlines . [uh] give me the flight,
numbers, the flight times from . [uh] Boston . to Dallas
Grammars for Spoken Language
Spoken language is similar to, yet different from, written language.
Differences include:
lexical statistics (e.g., spoken language has more pronouns)
constructional simplicity (e.g., less embedding)
Slide 42
disfluencies (e.g., uh, um, word repetitions, false starts)
sentence fragments (e.g., around 4 p.m. as an answer)
Disfluencies in Spoken Language
Interruption Point
Does American airlines offer any one−way flights
Reparandum
[uh]
Editing Phase
one−way fares for 160 dollars?
Repair
false start (reparandum)
Slide 43
uh
repair
Reading
J&K Chapter 9
Slide 44
Acknowledgements
Some content in these slides is drawn from course materials created by James
Martin and by Steven Bird.
Slide 45