Download Statistical learning of grammars

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Dependency grammar wikipedia , lookup

Controlled grammar wikipedia , lookup

Construction grammar wikipedia , lookup

Context-free grammar wikipedia , lookup

Transformational grammar wikipedia , lookup

Parsing wikipedia , lookup

Junction Grammar wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Transcript
Outline
Introduction
Stochastic grammars
Supervised learning
Unsupervised learning
Learning a real language
Conclusion
Statistical learning of grammars
Mark Johnson
Brown University
BUCLD 2005
Statistical learning and implicit negative
evidence
I
Logical approach to acquisition
L1
I No negative evidence
⇒ subset problem: guess L2 when true lg is L1
I
I
L2
I
I
if L2 − L1 is expected to occur but doesn’t
⇒ L2 is probably wrong
implicit negative evidence
succeeds where logical learning fails (e.g., PCFGs)
I
I
stronger input assumptions (follows distribution)
weaker success criteria (probabilistic)
Statistical learners learn from statistical distributional
properties of input
I
Statistical approach to learning
I
I
What is statistical learning?
I
I
Statistical learning (a.k.a. machine learning) is a
separate field
I
I
Both logic and statistics are kinds of inference
I
not just whether something occurs (logical learning),
but how often
assumes input follows some (unknown) probability
distribution
mathematical theories relating learning goal with
statistics
most informative statistic depends on:
I
statistical inference uses more information from input
I
I
what learner is trying to learn
current state of learner
much more than transitional probabilities!
Vapnik (1998) Statistical Learning Theory
What are the right units of generalization?
I
I
grammars are tools for investigating different units of
generalization
grammars can model wide variety of phenomena
I
I
I
I
1. Colorless green ideas sleep furiously.
2. *Furiously sleep ideas green colorless.
various types of grammatical dependencies
word segmentation (Brent)
syllable structure (Goldwater and Johnson)
morphological dependencies (Goldsmith)
I
I
S
NP
Sam
VP
V
VP
+control
promised
TO
to
Phrase−structure
grammar
Sam promised to write an
VP
V
write
article
Both sentences have zero frequency
⇒ frequency 6= well-formedness
Hidden class bigram model
P(colorless green ideas sleep furiously)
= P(colorless)P(green|colorless) . . .
= 2 × 105 × P(furiously sleep ideas green colorless)
Dependency grammar
NP
+transitive
Units of generalization in learning
DT
N
an
article
Chomsky (1957) Syntactic Structures
Pereira (2000) “Formal grammar and information theory: Together again?”
Outline
Introduction
Stochastic grammars
Supervised learning
Why grammars?
1. Useful for both production and comprehension
2. Compositional representations seem necessary for
semantic interpretation
3. Curse of dimensionality: the number of possibly related
entities grows exponentially
I
Unsupervised learning
I
I
Learning a real language
Conclusion
1,000 words = 1,000 unigrams, 1,000,000 bigrams,
1,000,000,000 trigrams, . . . (sparse data)
grammars identify relationships to generalize over
sparse data problems are more severe with larger,
more specialized representations
4. “Glass-box” models: (you can see inside)
the learner’s assumptions and conclusions are
explicit
There are stochastic variants of most
grammars
I
I
Grammar generates candidate structures
(e.g., string of words, trees, OT candidates,
construction grammar analyses, minimalist
derivations, . . . )
Associate numerical weights with configurations that
occur in these structures
I
I
I
I
I
pairs of adjacent words
rules used to derive structure
constructions occuring in structure
P&P parameters (e.g., H EAD F INAL)
Combine (e.g., multiply) the weights of
configurations occuring in a structure to get its score
Abney (1997) “Stochastic Attribute-Value Grammars”
Probabilistic Context-Free Grammars
I
The probability of a tree is the product of the
probabilities of the rules used to construct it
1.0 S → NP VP
0.75 NP → George
0.6 V → barks

 NP

P
George
S
VP
V
barks



 = 0.45

1.0 VP → V
0.25 NP → Al
0.4 V → snores



P

S
NP
VP
Al
V
snores



 = 0.1

Learning as optimization
I
Pick a task that the correct grammar should be able to
do well
I
I
I
I
predicting sentences and their structures (supervised
learning)
predicting the (next) words in sentences
(unsupervised learning)
Find weights that optimize performance on task
Searching for optimal weights is usually easier than
searching for optimal categorical grammars
Rummelhart and McClelland (1986) Parallel Distributed Processing
Tesar and Smolensky (2000) Learnability in Optimality Theory
Outline
Introduction
Stochastic grammars
Supervised learning
Unsupervised learning
Learning a real language
Conclusion
Grammars and generalizations
I
Learning PCFGs from trees (supervised)
Grammar determines units of generalization
S
NP
S
VP
rice grows
NP
S
VP
rice grows
Rule
Count Rel Freq
S → NP VP
3
1
NP → rice
2
2/3
NP → corn
1
1/3
VP → grows
3
1
NP
VP
corn grows

S

P
 NP

VP 
 = 2/3

rice grows

S



Rel freq is maximum likelihood estimator

P
 NP VP  = 1/3
(selects rule probabilities that
maximize probability of trees)
corn grows
Grammars and generalizations
I
Grammars and generalizations
Grammar determines units of generalization
I
I
Training data: 50%: N, 30%: N PP, 20%: N PP PP
with flat rules NP → N, NP → N PP, NP → N PP PP
predicted probabilities replicate training data
50% NP 30% NP
20% NP
N
N PP
N PP PP
I
Grammar determines units of generalization
I
Training data: 50%: N, 30%: N PP, 20%: N PP PP
Grammars and generalizations
I
I
I
I
Training data: 50%: N, 30%: N PP, 20%: N PP PP
with flat rules NP → N, NP → N PP, NP → N PP PP
predicted probabilities replicate training data
50% NP 30% NP
20% NP
N
I
N PP
I
I
NP PP
NP PP
NP PP
N
Finding best units of generalization
I
NP PP
NP PP
NP PP
N
Training data: 50%: N, 30%: N PP, 20%: N PP PP
with flat rules NP → N, NP → N PP, NP → N PP PP
predicted probabilities replicate training data
50% NP 30% NP
20% NP
N
NP
N
Grammar determines units of generalization
N PP PP
but with adjunction rules NP → N, NP → NP PP
58%: NP 24%: NP
10%: NP
5%:
N
I
Grammars and generalizations
Grammar determines units of generalization
N PP
N PP PP
but with adjunction rules NP → N, NP → NP PP
58%: NP 24%: NP
10%: NP
5%:
NP
NP PP
NP PP
N
N
NP PP
NP PP
NP PP
N
NP PP
N
Grammars and generalizations
I
I
I
N PP
NP PP
NP PP
NP PP
I
Finding best units of generalization
I
Training data: 50%: N, 30%: N PP, 20%: N PP PP
with flat rules NP → N, NP → N PP, NP → N PP PP
predicted probabilities replicate training data
50% NP 30% NP
20% NP
N PP
N PP PP
but with adjunction rules NP → N, NP → NP PP
58%: NP 24%: NP
10%: NP
5%:
NP
NP PP
NP PP
N
NP PP
NP PP
N
I
I
N
NP
N
Grammar determines units of generalization
N PP PP
but with adjunction rules NP → N, NP → NP PP
58%: NP 24%: NP
10%: NP
5%:
N
I
I
Training data: 50%: N, 30%: N PP, 20%: N PP PP
with flat rules NP → N, NP → N PP, NP → N PP PP
predicted probabilities replicate training data
50% NP 30% NP
20% NP
N
I
I
Grammars and generalizations
Grammar determines units of generalization
N
NP PP
N
N
Predicate and argument structure in Lexicalized
Tree-Adjoining Grammar
Head-argument dependencies in Dependency
Grammar
I
Finding best units of generalization
I
NP PP
NP PP
NP PP
NP PP
N
Predicate and argument structure in Lexicalized
Tree-Adjoining Grammar
Learning from words alone (unsupervised)
I
I
I
Training data consists of strings of words w
Optimize grammar’s ability to predict w: find
grammar that makes w as likely as possible
Expectation maximization is an iterative procedure for
building unsupervised learners out of supervised
learners
I
I
I
I
parse a bunch of sentences with current guess at
grammar
weight each parse tree by its probability under
current grammar
estimate grammar from these weighted parse trees as
before
Each iteration is guaranteed not to decrease P(w) (but
can get trapped in local minima)
Dempster, Laird and Rubin (1977) “Maximum likelihood from incomplete
data via the EM algorithm”
Outline
Introduction
Stochastic grammars
Supervised learning
Unsupervised learning
Learning a real language
Conclusion
Probability of “English”
Expectation Maximization with a toy
grammar
1
Geometric 0.1
average
sentence 0.01
probability
0.001
Initial rule probs
1e-04
1e-05
1e-06
0
1
3
2
Iteration
4
5
rule
···
VP → V
VP → V NP
VP → NP V
VP → V NP NP
VP → NP NP V
···
Det → the
N → the
V → the
prob
···
0.2
0.2
0.2
0.2
0.2
···
0.1
0.1
0.1
“English” input
the dog bites
the dog bites a man
a man gives the dog a bone
···
“pseudo-Japanese” input
the dog bites
the dog a man bites
a man the dog a bone gives
···
Probability of “Japanese”
Rule probabilities from “English”
1
1
Geometric 0.1
average
sentence 0.01
probability
0.001
Rule
probability
0.5
0.4
1e-04
0.3
0.2
1e-05
1e-06
VP →V NP
VP →NP V
VP
→V
NP NP
0.8
VP →NP NP V
Det →the
0.7
N →the
0.6
V →the
0.9
0.1
0
1
3
2
Iteration
4
5
0
0
1
2
3
Iteration
4
5
Statistical grammar learning
I
I
I
I
1
Simple algorithm: learn from your best guesses
I
I
Rule probabilities from “Japanese”
requires learner to parse the input
“Glass box” models: learner’s prior knowledge and
learnt generalizations are explicitly represented
Optimization of smooth function of rule weights ⇒
learning can involve small, incremental updates
Learning structure (rules) is hard, but . . .
Parameter estimation can approximate rule learning
I
I
I
start with “superset” grammar
estimate rule probabilities
discard low probability rules
VP →V NP
VP →NP V
VP
→V
NP NP
0.8
VP →NP NP V
Det →the
0.7
N →the
0.6
V →the
0.9
Rule
probability
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
Iteration
4
5
Outline
The importance of starting small
Introduction
I
Stochastic grammars
EM works by learning from its own parses
I
I
Supervised learning
I
Unsupervised learning
Learning a real language
Conclusion
Each parse is weighted by its probability
Rules used in high-probability parses receive strong
reinforcement
In grammar-based models, ambiguity grows with
sentence length
I longer sentences are typically highly ambiguous
⇒ lower average parse probability
⇒ less clear information about which rules are most
useful
Training from real language
Applying EM learning to real language
I
1. Extract productions from trees and estimate
probabilities probabilities from trees to produce
PCFG.
2. Initialize EM with the treebank grammar and MLE
probabilities
3. Apply EM (to strings alone) to re-estimate
production probabilities.
4. At each iteration:
I
I
Measure the likelihood of the training data and the
quality of the parses produced by each grammar.
Test on training data (so poor performance is not due
to overlearning).
I
I
ATIS treebank consists of 1,300 hand-constructed
parse trees
ignore the words (in this experiment)
about 1,000 PCFG rules are needed to build these
trees
S
.
VP
VB
NP
NP
Show PRP NP
DT
JJ
NNS
.
PP
me PDT the nonstop flights PP
all
IN
ADJP
PP
NP
PP
TO NP early IN
from NNP to NNP
Dallas
JJ
Denver
in
NP
DT
NN
the morning
Accuracy of parses produced using the learnt
grammar
1
Precision
Recall
0.95
Parse
Accuracy
0.9
0.85
0.8
0.75
0.7
0
5
10
Iteration
15
20
Probability of training strings
-14000
-14200
-14400
-14600
-14800
log P
-15000
-15200
-15400
-15600
-15800
-16000
0
5
10
Iteration
15
20
Outline
Introduction
Stochastic grammars
Discussion
I
I
Predicting words 6= finding correct structure
Why didn’t the learner find the right structures?
I
Supervised learning
I
I
I
Unsupervised learning
I
Grammar ignores semantics (Zettlemoyer and Collins)
Predicting words is wrong objective
Wrong kind of grammar (Klein and Manning)
Wrong training data (Yang)
Wrong learning algorithm (much work in CL and ML)
Learning a real language
Conclusion
de Marken (1995) “Lexical heads, phrase structure and the induction of
grammar”
Bayesian learning
Summary
I
I
I
A statistical learning framework that integrates:
I
I
I
I
“hard” priors ignore some analyses, focus on others
“soft” priors bias learner toward certain hypotheses
I
I
I
likelihood of the data (prediction)
bias or prior knowledge (e.g., innate constraints)
markedness constraints (e.g., syllables have onsets)
can be over-ridden by sufficient data
I
I
Statistical learning extracts more information from input
Curse of dimensionality: something must guide learner
to focus on correct generalizations
Stochastic versions of most kinds of grammar
Statistical grammar learning combines:
I
I
I
Glass box: grammars use explicit representations
I
evaluate different kinds of universals
I
I
I
I
compositional representations
optimization-based learning
generalizations learnt
prior knowledge assumed
predicting the input 6= correctly analysing the input
Applied to psycholinguistics (Jurafsky, Crocker)
Should be useful for child language
Grammars in computational linguistics
1980s: hand-written linguistic grammars on
linguistically interesting examples
early 1990s: simple statistical models dominate speech
recognition and computational linguistics
I they can learn
I corpus-based evaluation methodology
late 1990s: techniques for statistical learning of
probabilitstic grammars
today: loosely linguistic grammar-based approaches
are competitive, but so are
non-grammar-based approaches