Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DKRLM: Discriminant Knowledge-Rich Language Modeling for Machine Translation Alon Lavie “Visionary Talk” LTI Faculty Retreat May 4, 2007 Background: Search-based MT • All state-of-the-art MT approaches work within a general search-based paradigm – Translation Models “propose” pieces of translation for various sub-sentential segments – Decoder puts these pieces together into complete translation hypotheses and searches for the best scoring hypothesis • (Target) Language Modeling is the most dominant source of information in scoring alternative translation hypotheses May 4, 2007 DKRLM 2 The Problem • Most MT systems use standard statistical LMs that come from SR, usually “as is” – SRI-LM toolkit, CMU/CU LM, SALM toolkit – Until recently, usually trigram models • The Problem: these LMs are not good at discriminating between good and bad translations! • How do we know? – Oracle experiments on n-best lists of MT output consistently show that far better translations are “hiding” in the n-best lists but are not being selected by our MT systems – Also true of our MEMT system… which led me to start thinking about this problem! May 4, 2007 DKRLM 3 The Problem • Why do standard statistical LMs not work well for MT? – MT hypotheses are very different from SR hypotheses • Speech: mostly correct word-order, confusable homonyms • MT: garbled syntax and word-order, wrong choices for some translated words – MT violates some basic underlying assumptions of statistical LMs: • Indirect Discrimination: better translations should have better LM scores, but LMs are not trained to directly discriminate between good and bad translations! • Fundamental Probability Estimation Problems: Backoff “Smoothing” for unseen n-grams is based on an assumption of training data sparsity, but the majority of n-grams in MT hypotheses have not been seen because they are not grammatical (they really should have a zero probability!) May 4, 2007 DKRLM 4 The New Idea • Rather than attempting to model the probabilities of unseen n-grams, we look at the problem differently: – Extract instances of lexical, syntactic and semantic features from each translation hypothesis – Determine whether these instances have been “seen before” (at least once) in a large monolingual corpus • The Conjecture: more grammatical MT hypotheses are likely to contain higher proportions of feature instances that have been seen in a corpus of grammatical sentences. • Goals: – Find the set of features that provides the best discrimination between good and bad translations – Learn how to combine these into a LM-like function for scoring alternative MT hypotheses May 4, 2007 DKRLM 5 Outline • Knowledge-Rich Features • Preliminary Experiments: – Compare feature occurrence statistics for MT hypotheses versus human-produced (reference) translations – Compare ranking of MT and “human” systems according to statistical LMs versus a function based on long n-gram occurrence statistics – Compare n-grams and n-chains as features for binary classification “human versus MT” • Research Challenges • New Connections with IR May 4, 2007 DKRLM 6 Knowledge-Rich Features • Lexical Features: – “long” n-gram sequences (4 words and up) • Syntactic/Semantic Features: – POS n-grams – Head-word Chains – Specific types of dependencies: • Verbs and their dependents • Nouns and their dependents • “long-range” dependencies – Content word co-occurrence statistics • Mixtures of Lexical and Syntactic Features: – Abstracted versions of word n-gram sequences, where words are replaced by POS tags or Named-entity tags May 4, 2007 DKRLM 7 Head-Word Chains (n-chains) The boy ate the red apple • Head-word Chains are chains of syntactic dependency links (from dependent to their heads) • Bi-chains: [theboy] [boyate] [theapple] [redapple] [appleate] • Tri-chains: [theboyate] [theappleate] [redappleate] • Four-chains: none (for this example)! May 4, 2007 DKRLM 8 Specific Types of Dependencies • Some types of syntactic dependencies may be more important than others for MT • Consider specific types of dependencies that are most important for syntactic and semantic structure: – Dependencies involving content words – Long-distance dependencies – Verb/argument dependencies: focus only on the bichains where the head is the verb: [boyate] and [appleate] – Noun/modifier dependencies: focus only on the bichains where the noun is the head: [theboy] [anapple] [redapple] May 4, 2007 DKRLM 9 Feature Occurrence Statistics for MT Hypotheses • The general Idea: determine the fraction of feature instances that have been observed to occur in a large human-produced corpus • For n-grams: – Extract all n-gram sequences of order n from the hypothesis – Look-up whether each n-gram instance occurs in the corpus – Calculate fractions of “found” n-grams for each order n • For n-chains: – Parse the MT hypothesis (into dependency structure) – Look-up whether each n-chain instance occurs in a database of n-chains extracted from the large corpus – Calculate fractions of “found” n-chains for each order n May 4, 2007 DKRLM 10 Content-word Co-occurrence Statistics • Content-word co-occurrences: (unordered) pairs of content words (nouns, verbs, adjectives, adverbs) that co-occur in the same sentence • Restricted version: subset of co-occurrences that are in a direct syntactic dependency within the sentence (subset of bichains) • Idea: – Learn co-occurrence pair strengths from large monolingual corpora using statistical measures: DICE, t-score, chi-square, likelihood ratio – Use average co-occurrence pair strength as a feature for scoring MT hypotheses – Weak way of capturing the syntax/semantics within sentences • Preliminary experiments show that these features are somewhat effective in discriminating between MT output and human references • Thanks Ben Han! [MT Lab Project, 2005] May 4, 2007 DKRLM 11 Preliminary Experiments I • Goal: compare n-gram occurrence statistics for MT hypotheses versus human-produced (reference) translations • Setup: – Data: NIST Arabic-to-English MT-Eval 2003 (about 1000 sentences) – Output from three strong MT systems and four reference translations – Used Suffix-Array LM toolkit [Zhang and Vogel 2006] modified to return for each string call the length of the longest suffix of the string that occurs in the corpus – SALM used to index a subset of 600 million words from the Gigaword corpus – Searched for all n-gram sequences of length eight extracted from the translation • Thanks to Greg Hanneman! May 4, 2007 DKRLM 12 Preliminary Experiments I MT Translations Reference Translations Ref/MT Ratio Margin 8-grams 2.1% 2.9% 1.38 +38% 7-grams 4.9% 6.4% 1.31 +31% 6-grams 11.4% 14.1% 1.24 +24% 5-grams 25.2% 29.1% 1.15 +15% 4-grams 48.4% 52.2% 1.08 +8% 3-grams 75.9% 77.7% 1.02 +2% 2-grams 94.8% 94.4% 0.995 -0.5% 1-grams 99.3% 98.2% 0.989 -1.1% May 4, 2007 DKRLM 13 Preliminary Experiments II • Goal: Compare ranking of MT and “human” systems according to statistical LMs versus a function based on long n-gram occurrence statistics • Same data setup as in the first experiment • Calculate sentence scores as average per word LM score • System score is average over all its sentence scores • Score each system with three different LMs: – SRI-LM trigram LM trained on 260 million words – SALM suffix-array LM trained on 600 million words – A new function that assigns exponentially more weight to longer n-gram “hits”: 1 n ( ord (i ) 8) score 3 n i 1 May 4, 2007 DKRLM 14 Preliminary Experiments II System SRI-LM trigram LM SALM 8-gram LM Occurrencebased Exp score Ref ahe -2.23 1 -5.59 1 0.01059 1 Ref ahi -2.28 4 -5.87 4 0.00957 2 Ref ahd -2.31 5 -5.99 5 0.00926 3 Ref ahg -2.33 6 -6.04 7 0.00914 4 MT system 1 -2.27 3 -5.77 3 0.00895 5 MT system 2 -2.24 2 -5.75 2 0.00855 6 MT system 3 -2.39 7 -6.01 6 0.00719 7 May 4, 2007 DKRLM 15 Preliminary Experiments III • Goal: Directly discriminate between MT and human translations using a binary SVM classifier trained on n-gram versus n-chain occurrence statistics • Setup: – Data: NIST Chinese-to-English MT-Eval 2003 (919 sentences) – Four MT system outputs and four human reference translations – N-chain database created using SALM by extracting all n-chains from a dependency-parsed version of the English Europarl corpus (600K sentences) – Train SVM classifier on 400 sentences from two MT systems and two human “systems” – Test classification accuracy on 200 unseen test sentences from the same MT and human systems – Features for SVM: n-gram “hit” fractions (all n) vs. n-chain fractions • Thanks to Vamshi Ambati May 4, 2007 DKRLM 16 Preliminary Experiments III • Results: – Experiment 1: • N-gram classifier: 49% accuracy • N-chain classifier: 69% accuracy – Experiment 2: • N-gram classifier: 52% accuracy • N-chain classifier: 63% accuracy • Observations: – Mixing both n-gram and n-chains did not improve classification accuracy – Features include both high and low-order instances (did not try with only high-order ones) – N-chain database is from different domain than test data, and not a very large corpus May 4, 2007 DKRLM 17 Preliminary Conclusions • Statistical LMs do not discriminate well between MT hypotheses and human reference translations also poor in discriminating between good and bad MT hypotheses • Long n-grams and n-chains occurrence statistics differ significantly between MT hypotheses and human reference translations • Can potentially be useful as discriminant features for identifying better (more grammatical and fluent) translations May 4, 2007 DKRLM 18 Research Challenges • Develop Infrastructure for Computing with Knowledge-Rich Features – Scale up to querying against much larger monolingual corpora (terabytes and up) – Parsing and annotation of such vast corpora • Explore more complex features • Finding the set of features that are most discriminant • Develop Methodologies for training LM-like discriminant scoring functions: – – – – SVM and/or other classifiers on MT versus human SVM and/or other classifiers on MT versus MT “Oracle” Direct regression against human judgments Parameter optimization for maximizing automatic MT metric scores (BLEU, METEOR, etc.) • “Incremental” features that can be used during decoding versus full set of features for n-best list reranking May 4, 2007 DKRLM 19 New Connections with IR • The “occurrence-based” formulation of the LM problem transforms it from a counting and estimation problem to an IR-like querying problem: – To be effective, we think this may require querying against extremely large volumes of monolingual text, and structured versions of such text can we do this against local snapshots of the entire web? – SALM suffix-array infrastructure can currently handle up to about the size of the Gigaword corpus (within 16GB memory) – Can IR engines such as LEMUR/Indri be adapted to the task? May 4, 2007 DKRLM 20 New Connections with IR • Challenges this type of task imposes on IR (insights from Jamie Callan): – The larger issue: IR search engines as query interfaces to vast collections of structured text: • Building an index suitable for very fast “n-gram” lookups that satisfy certain properties. • The n-gram sequences might be a mix of surface features and derived features based on text annotations, e.g., $PersonName, or POS=N – Specific Challenges: • How to build such indexes for fast access? • What does the query language look like? • How to deal with memory/disk vs. speed tradeoff issues? • Can we get LTI students to do this kind of research? May 4, 2007 DKRLM 21 Final Words… • Novel and exciting new research direction there are at least one or two PhD theses hiding in here… • Submitted as a grant proposal to NSF last December (jointly with Rebecca Hwa from Pitt) • Influences: Some of these ideas were influenced by Jaime’s CBMT work, and by Rebecca’s work on using syntactic features for automatic MT evaluation metrics • Acknowledgments: – Thanks to Joy Zhang and Stephan Vogel for making the the SALM toolkit available to us – Thanks to Rebecca Hwa and to my students Ben Han, Greg Hanneman and Vamshi Ambati for preliminary work on these ideas. May 4, 2007 DKRLM 22