Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CA446 Statistical Machine Translation Week 2: Introduction to SMT Lecturer: Qun Liu Lab Tutor: Xiaofeng Wu, Iacer Calixto 2nd Semester, 2014-2015 Academic Year http://computing.dcu.ie/~qliu/CA446 Content Linguistics Basics Probability Theory Basics What is SMT? Noisy Channel Framework for SMT SMT Flow 23 February 2015 2 Words • Words are a basic unit of meaning. • The process of breaking a sentence into words is known as tokenisation. • Words can be categorised into different categories (parts-of-speech): noun, verb, adjective, adverb, determiner, pronoun, preposition, conjunction, possessive marker, interjection, punctuation, etc. • We make a distinction between content and function words. 23 February 2015 3 Morphemes • Words can be decomposed into morphemes. • We make a distinction between inflectional morphology and derivational morphology. • Inflectional morphemes are used to mark grammatical information such as gender, number, person, tense and case. • Derivational morphemes are used to change the part-of-speech of a word, e.g. the suffix al changes the noun sensation into the adjective sensational. 23 February 2015 4 Syntax • Syntax is the study of sentence structure and the relationship between words. • We can represent the structure of a sentence using various types of data structures, including graphs and trees. • Two very common ways are: – Phrase Structure Trees – Dependency Trees 23 February 2015 5 Phrase Structure versus Dependency subj 23 February 2015 obj 6 Discourse • Discourse is concerned with written and spoken communication. • Study of language above the sentence level. • A major challenge for automatic discourse analysis is co-reference or anaphor resolution. – The snow fell yesterday. This was the first time it had come in November. – In their free time, the children played Candy Crush. 23 February 2015 7 Content Linguistics Basics Probability Theory Basics What is SMT? Noisy Channel Framework for SMT SMT Flow 23 February 2015 8 Random Variables • A random variable is a quantity whose value is not fixed and which can take on different values. • If X is a random variable describing the outcome of rolling a dice, possible values for X are 1, 2, 3, 4, 5 and 6. 23 February 2015 9 Probability Distributions • A probability distribution is a function which maps each value of a random variable to a probability, e.g. • If X is a random variable describing the outcome of rolling a dice p(X=1) = 1/6 p(X=2) = 1/6 p(X=3) = 1/6 23 February 2015 p(X=4) = 1/6 p(X=5) = 1/6 p(X=6) = 1/6 10 Probability Distributions • Each probability is in the range 0...1 ∀x : 0 ≤ p(X=x) ≤ 1 • The sum of the probabilities of all possible values of the variable is 1. Σx p(X=x) = 1 23 February 2015 11 Probability Distributions • Two very common probability distributions are: – Uniform distribution – Normal distribution • Probability distributions can be estimated from data samples, the larger the better. 23 February 2015 12 Normal Distribution 23 February 2015 13 Probability Distributions • Say we suspected that the dice is loaded..... • Throw the dice k times and estimate the probability of each value p(X=1) = #1/k p(X=2) = #2/k p(X=3) = #3/k p(X=4) = #4/k p(X=5) = #5/k p(X=6) = #6/k where #n is the number of times n is thrown. 23 February 2015 14 Notation • We can omit the name of the random variable and write p(x) instead of p(X=x) 23 February 2015 15 Joint Probability Distributions • Given two random variables X and Y, the probability of X taking on the value x and Y taking on the value y is the joint probability, written as: p(x,y) • If X and Y are independent, then p(x,y)=p(x)p(y) 23 February 2015 16 Conditional Probability Distributions Given two random variables, X and Y, the probability of Y taking on the value y given that X has the value x is the conditional probability, written as 𝑝(𝑥, 𝑦) 𝑝 𝑦𝑥 = 𝑝(𝑥) 23 February 2015 17 Bayes’ Rule From the definition of conditional probability, we can derive Bayes’ Rule 𝑝 𝑦 𝑥 𝑝(𝑥) 𝑝 𝑥𝑦 = 𝑝(𝑦) 23 February 2015 18 Content Linguistics Basics Probability Theory Basics What is SMT? Noisy Channel Framework for SMT SMT Flow 23 February 2015 19 Different Approaches to MT Different approaches can be classified based on how whether they are rule-based or data-driven: • In a rule-based system, rules for translating from one language to another are handcrafted by linguists and knowledge engineers. • In a data-driven system, rules are learnt automatically from large quantities of data. 23 February 2015 20 What is SMT? SMT systems are data-driven and: • automatically learn how to translate text from large corpora of example translations (mostly produced by humans). • learn how to translate during the training phase. • assign likelihoods to translation alternatives. • The corpus of translations is known as a bitext or parallel corpus. 23 February 2015 21 Rosette Stone 23 February 2015 22 Parallel Corpus 23 February 2015 23 Parallel Corpus 23 February 2015 24 Parallel Corpus 23 February 2015 25 SMT: The Basic Idea 23 February 2015 26 Aligning Words: An Example 23 February 2015 27 Aligning Words: An Example 23 February 2015 28 Aligning Words: An Example 23 February 2015 29 Aligning Words: An Example 23 February 2015 30 Aligning Words: An Example 23 February 2015 31 Aligning Words: An Example 23 February 2015 32 Aligning Words: An Example 23 February 2015 33 Aligning Words: An Example 23 February 2015 34 Aligning Words: An Example 23 February 2015 35 Aligning Words: An Example 23 February 2015 36 SMT: The Basic Idea • The idea of SMT is just to simulate the above process of solving the word puzzle. • First we need to define a probabilistic model for the translation process. • Then we will use this probabilistic model to find a most likely target translation for a given source sentence. 23 February 2015 37 Probabilistic Model for Translation • Given any English sentence e and any foreign sentence f, we define the probability that e is a translation of f : 𝑝(𝑒|𝑓) Where the normalization condition is: 𝑝(𝑒|𝑓) = 1 𝑒 23 February 2015 38 Translation as searching • Once we have a probability translation model, the translation problem can be regarded as the problem of searching an English sentence e with the highest probability given the input foreign sentence f : 𝑒 = argmax 𝑝(𝑒|𝑓) 𝑒 23 February 2015 39 Content Linguistics Basics Probability Theory Basics What is SMT? Noisy Channel Framework for SMT SMT Flow 23 February 2015 40 Noisy Channel Framework • The Noisy Channel Framework (Model) was proposed by IBM researchers Peter F. Brown, et al. in 1990: Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, Paul S. Roossin, A Statistical Approach to Machine Translation, Computational Linguistics,1990 23 February 2015 41 Noisy Channel Framework English French 23 February 2015 42 Noisy Channel Framework p(e) e p( f |e ) f English French 23 February 2015 43 Noisy Channel Framework p(e) e p( f |e ) f 𝑝 𝑓 = 𝑝 𝑒 𝑝(𝑓|𝑒) English French 23 February 2015 44 Noisy Channel Framework Applying Bayes’ Rule, we have: 𝑝 𝑒 𝑝(𝑓|𝑒) 𝑝 𝑒|𝑓 = 𝑝(𝑓) Thus: 𝑒 = argmax 𝑝 𝑒|𝑓 = argmax 𝑝 𝑒 𝑝(𝑓|𝑒) 𝑒 23 February 2015 𝑒 45 Noisy Channel Framework 𝑒 = argmax 𝑝 𝑒 𝑝(𝑓|𝑒) 𝑒 SMT Components 𝑒 = argmax 𝑝 𝑒 𝑝(𝑓|𝑒) 𝑒 Translation Model Decoder Language Model 23 February 2015 47 Noisy Channel Framework • The translation model models how likely it is that f is a translations of e – adequacy. • The language model models how likely it is that e is an acceptable sentence – fluency. • The decoder searches for the most likely e. 23 February 2015 48 Fluency versus Adequacy Source Sentence: Le chat entre dans la chambre. • Adequate fluent translation: The cat enters the bedroom. • Adequate disfluent translation: The cat enters in the bedroom. • Fluent inadequate translation: My Granny plays the piano. • Disfluent inadequate translation: piano Granny the plays My 23 February 2015 49 Content Linguistics Basics Probability Theory Basics What is SMT? Noisy Channel Framework for SMT SMT Flow 23 February 2015 50 Model Training How can we obtain a translation model and a language model? • A probabilistic model can be trained using data samples • For translation model, the data samples are bilingual (parallel) text corpus • For language model, the data samples are monolingual text corpus 23 February 2015 51 SMT Flow Source Text Parallel Text Data Translation Model Translation Model Training Language Model Language Model Training Decoding Monolingual Text Data of target language Target Text 23 February 2015 52 Discussion 23 February 2015 53 Acknowledgement Parts of the content of this lecture are taken from previous lectures and presentations given by Jennifer Foster, Declan Groves, Yvette Graham, Kevin Knight, Josef van Genabith, Andy Way. 23 February 2015 54