Download LARG-20010510

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994 Why automatic grammar induction (AGI) • Enables using domain-dependent grammars without expert intervention. • Enables using person-dependent grammars without expert intervention. • Can be used on different languages (without a linguist familiar with the particular language). • A process of grammar induction with expert guidance may be more accurate than human written grammar since computers are more adept than humans in analyzing large corpora. Why statistical approaches to AGI • In practice languages are not logical structures. • Often said sentences are not precisely grammatical. The solution of expanding the grammar leads to explosion of grammar rules. • A large grammar will lead to many parses of the same sentences. Clearly, some parses are more accurate than others. Statistical approaches enable including a large set of grammar rules together with assigning probability to each parse. • There are known optimality conditions and optimization procedure in statistics. Some Bayesian statistics M  ( , s) • For each grammar (rule probabilities+ rules), a prior probability p(M) is assigned. This value may represent experts’ opinion about how likely is this grammar. • Upon introduction of a training set X (an unlabeled corpus), the model posterior is computed by Bayes law: p( M | C )  p ( M ) p (C | M )  p ( M ) p (C | M ) p (C ) • Either the grammar that maximizes the posterior is kept (as the best grammar), or the set of all grammars and their posteriors is kept (better). Priors for CF grammars • The prior of a grammar p(M) is split to two parts: p( M )  p( , s)  p( s) p( | s) • The component p(s ) is taken to introduce a bias towards short grammars (less rules). One way of doing that, though still heuristic, is minimum description length (MDL): p ( s )  e l ( s ) • Prior for the rule probabilities is taken to be uniform Dirichlet prior which has the effect of smoothing low counts of rules usage. Grammar posterior • Too hard to maximize over the posterior of both the rules and the probabilities. Instead, the search is done to maximize the posterior of the rules only: p ( s ) p (C | s ) p ( s ) p( s | C )   p (C ,  | s )   p (C ) p (C )  p( s ) p( s ) p (  | s ) p ( C |  , s )  p( | s )  p( x |  , s )                 p(C )  Dirichlet prior sum over all derivations p(C )  Dirichlet prior xC sum over all derivations of x  p( s ) p( | s ) p(V ( x ) |  , s )      p (C )  xC prob. of Viterbi derivation • Where V is the Viterbi derivation of x. The last integral has a closed form solution. Maximizing the posterior • Even though computing an approximation to the posterior is possible in closed form, coming up with a grammar that maximizes it is still a hard problem. • A. Stolcke: Start with many rules. Apply greedy operations of merging rules to maximize the posterior. • Model merging was applied to Hidden Markov models, probabilistic context free grammar and probabilistic attribute grammar (PCFG with semantic features tied to non-terminals). A concrete example: PCFG • A specific PCFG consists of a list of rules s and a set of production probabilities  . • For a given s, it is possible to learn the production probabilities with EM. Coming up with an optimal s is still an open problem. Stolcke’s model merging is an attempt to tackle this problem. • Given a corpus (set of sentences), an initial set of rules is constructed: w w ...w in the corpus add the rules 1 2 l S  X 1 X 2... X l X 1 w1 ,..., X l wl : Merging operators • Non-terminal merging: replace two existing non-terminals with a single new non-terminal. Z1  1 X11, Z1  1 X 2 1 X1  1, X 2  1 merge( X1, X 2 )  Y Z1  1Y1, Z1  1Y1 Y  1, Y  2 • Non-terminal chunking: Given an ordered sequence of nonterminals, X 1 X 2 ... X k create a new non-terminal Y that expands to X 1 X 2 ... X k and replaces occurrences of X 1 X 2 ... X k in right hand side with Y. Z  X 1 X 2 ... X k  chunk ( X 1,..., X k )  Y Z  Y Y  X 1 X 2 ... X k PCFG priors Prior for rule probabilities: p( | s)   p(i | s)  cii 1 i i Prior for rules: For a non-lexical rule (doesn’t produce a terminal symbol) the description length is descriptio n _ length( X  Y1Y2 ...Yk )  (k  1) log(|  | 1) For a lexical rule (produces a terminal symbol) the description length is descriptio n _ length( X  a )  log |  | The prior was taken to be either exponentially decreasing or Poisson in the description length p( s )  e l ( s ) e  l ( r ) rs e   l ( s ) p( s )  l ( s )! Search strategy • • Start with the initial rules. Try applying all possible merge operations. For each resulting grammar compute the posterior and choose the merge which resulted in the highest posterior. • Search strategy: Best first search, Best first with look-ahead Beam search Now some examples…

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download LARG-20010510