Download Week 2: Introduction to SMT Statistical Machine Translation Lecturer: Qun Liu

Document related concepts

Data assimilation wikipedia , lookup

Transcript
CA446
Statistical Machine Translation
Week 2: Introduction to SMT
Lecturer: Qun Liu
Lab Tutor: Xiaofeng Wu, Iacer Calixto
2nd Semester, 2014-2015 Academic Year
http://computing.dcu.ie/~qliu/CA446
Content
Linguistics Basics
Probability Theory Basics
What is SMT?
Noisy Channel Framework for SMT
SMT Flow
23 February 2015
2
Words
• Words are a basic unit of meaning.
• The process of breaking a sentence into words
is known as tokenisation.
• Words can be categorised into different
categories (parts-of-speech): noun, verb,
adjective, adverb, determiner, pronoun,
preposition, conjunction, possessive marker,
interjection, punctuation, etc.
• We make a distinction between content and
function words.
23 February 2015
3
Morphemes
• Words can be decomposed into morphemes.
• We make a distinction between inflectional
morphology and derivational morphology.
• Inflectional morphemes are used to mark
grammatical information such as gender,
number, person, tense and case.
• Derivational morphemes are used to change the
part-of-speech of a word, e.g. the suffix al
changes the noun sensation into the adjective
sensational.
23 February 2015
4
Syntax
• Syntax is the study of sentence structure
and the relationship between words.
• We can represent the structure of a
sentence using various types of data
structures, including graphs and trees.
• Two very common ways are:
– Phrase Structure Trees
– Dependency Trees
23 February 2015
5
Phrase Structure versus
Dependency
subj
23 February 2015
obj
6
Discourse
• Discourse is concerned with written and spoken
communication.
• Study of language above the sentence level.
• A major challenge for automatic discourse
analysis is co-reference or anaphor resolution.
– The snow fell yesterday. This was the first time it had
come in November.
– In their free time, the children played Candy Crush.
23 February 2015
7
Content
Linguistics Basics
Probability Theory Basics
What is SMT?
Noisy Channel Framework for SMT
SMT Flow
23 February 2015
8
Random Variables
• A random variable is a quantity whose
value is not fixed and which can take on
different values.
• If X is a random variable describing the
outcome of rolling a dice, possible values
for X are 1, 2, 3, 4, 5 and 6.
23 February 2015
9
Probability Distributions
• A probability distribution is a function
which maps each value of a random
variable to a probability, e.g.
• If X is a random variable describing the
outcome of rolling a dice
p(X=1) = 1/6
p(X=2) = 1/6
p(X=3) = 1/6
23 February 2015
p(X=4) = 1/6
p(X=5) = 1/6
p(X=6) = 1/6
10
Probability Distributions
• Each probability is in the range 0...1
∀x : 0 ≤ p(X=x) ≤ 1
• The sum of the probabilities of all possible
values of the variable is 1.
Σx p(X=x) = 1
23 February 2015
11
Probability Distributions
• Two very common probability distributions
are:
– Uniform distribution
– Normal distribution
• Probability distributions can be estimated
from data samples, the larger the better.
23 February 2015
12
Normal Distribution
23 February 2015
13
Probability Distributions
• Say we suspected that the dice is loaded.....
• Throw the dice k times and estimate the
probability of each value
p(X=1) = #1/k
p(X=2) = #2/k
p(X=3) = #3/k
p(X=4) = #4/k
p(X=5) = #5/k
p(X=6) = #6/k
where #n is the number of times n is thrown.
23 February 2015
14
Notation
• We can omit the name of the random
variable and write
p(x)
instead of
p(X=x)
23 February 2015
15
Joint Probability Distributions
• Given two random variables X and Y, the
probability of X taking on the value x and Y
taking on the value y is the joint
probability, written as:
p(x,y)
• If X and Y are independent, then
p(x,y)=p(x)p(y)
23 February 2015
16
Conditional Probability
Distributions
Given two random variables, X and Y, the
probability of Y taking on the value y given that X
has the value x is the conditional probability,
written as
𝑝(𝑥, 𝑦)
𝑝 𝑦𝑥 =
𝑝(𝑥)
23 February 2015
17
Bayes’ Rule
From the definition of conditional probability,
we can derive
Bayes’ Rule
𝑝 𝑦 𝑥 𝑝(𝑥)
𝑝 𝑥𝑦 =
𝑝(𝑦)
23 February 2015
18
Content
Linguistics Basics
Probability Theory Basics
What is SMT?
Noisy Channel Framework for SMT
SMT Flow
23 February 2015
19
Different Approaches to MT
Different approaches can be classified based on
how whether they are rule-based or data-driven:
• In a rule-based system, rules for translating
from one language to another are handcrafted by linguists and knowledge engineers.
• In a data-driven system, rules are learnt
automatically from large quantities of data.
23 February 2015
20
What is SMT?
SMT systems are data-driven and:
• automatically learn how to translate text from
large corpora of example translations (mostly
produced by humans).
• learn how to translate during the training phase.
• assign likelihoods to translation alternatives.
• The corpus of translations is known as a bitext
or parallel corpus.
23 February 2015
21
Rosette Stone
23 February 2015
22
Parallel Corpus
23 February 2015
23
Parallel Corpus
23 February 2015
24
Parallel Corpus
23 February 2015
25
SMT: The Basic Idea
23 February 2015
26
Aligning Words: An Example
23 February 2015
27
Aligning Words: An Example
23 February 2015
28
Aligning Words: An Example
23 February 2015
29
Aligning Words: An Example
23 February 2015
30
Aligning Words: An Example
23 February 2015
31
Aligning Words: An Example
23 February 2015
32
Aligning Words: An Example
23 February 2015
33
Aligning Words: An Example
23 February 2015
34
Aligning Words: An Example
23 February 2015
35
Aligning Words: An Example
23 February 2015
36
SMT: The Basic Idea
• The idea of SMT is just to simulate the
above process of solving the word puzzle.
• First we need to define a probabilistic
model for the translation process.
• Then we will use this probabilistic model to
find a most likely target translation for a
given source sentence.
23 February 2015
37
Probabilistic Model
for Translation
• Given any English sentence e and any
foreign sentence f, we define the
probability that e is a translation of f :
𝑝(𝑒|𝑓)
Where the normalization condition is:
𝑝(𝑒|𝑓) = 1
𝑒
23 February 2015
38
Translation as searching
• Once we have a probability translation model,
the translation problem can be regarded as the
problem of searching an English sentence e with
the highest probability given the input foreign
sentence f :
𝑒 = argmax 𝑝(𝑒|𝑓)
𝑒
23 February 2015
39
Content
Linguistics Basics
Probability Theory Basics
What is SMT?
Noisy Channel Framework for SMT
SMT Flow
23 February 2015
40
Noisy Channel Framework
• The Noisy Channel Framework (Model)
was proposed by IBM researchers Peter
F. Brown, et al. in 1990:
Peter F. Brown, John Cocke, Stephen A. Della
Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John
D. Lafferty, Robert L. Mercer, Paul S. Roossin, A
Statistical Approach to Machine Translation,
Computational Linguistics,1990
23 February 2015
41
Noisy Channel Framework
English
French
23 February 2015
42
Noisy Channel Framework
p(e)
e
p( f |e )
f
English
French
23 February 2015
43
Noisy Channel Framework
p(e)
e
p( f |e )
f
𝑝 𝑓 = 𝑝 𝑒 𝑝(𝑓|𝑒)
English
French
23 February 2015
44
Noisy Channel Framework
Applying Bayes’ Rule, we have:
𝑝 𝑒 𝑝(𝑓|𝑒)
𝑝 𝑒|𝑓 =
𝑝(𝑓)
Thus:
𝑒 = argmax 𝑝 𝑒|𝑓 = argmax 𝑝 𝑒 𝑝(𝑓|𝑒)
𝑒
23 February 2015
𝑒
45
Noisy Channel Framework
𝑒 = argmax 𝑝 𝑒 𝑝(𝑓|𝑒)
𝑒
SMT Components
𝑒 = argmax 𝑝 𝑒 𝑝(𝑓|𝑒)
𝑒
Translation Model
Decoder
Language Model
23 February 2015
47
Noisy Channel Framework
• The translation model models how likely it
is that f is a translations of e – adequacy.
• The language model models how likely it is
that e is an acceptable sentence – fluency.
• The decoder searches for the most likely e.
23 February 2015
48
Fluency versus Adequacy
Source Sentence:
Le chat entre dans la chambre.
• Adequate fluent translation:
The cat enters the bedroom.
• Adequate disfluent translation:
The cat enters in the bedroom.
• Fluent inadequate translation:
My Granny plays the piano.
• Disfluent inadequate translation:
piano Granny the plays My
23 February 2015
49
Content
Linguistics Basics
Probability Theory Basics
What is SMT?
Noisy Channel Framework for SMT
SMT Flow
23 February 2015
50
Model Training
How can we obtain a translation
model and a language model?
• A probabilistic model can be trained using
data samples
• For translation model, the data samples
are bilingual (parallel) text corpus
• For language model, the data samples are
monolingual text corpus
23 February 2015
51
SMT Flow
Source Text
Parallel Text Data
Translation
Model
Translation
Model Training
Language
Model
Language
Model Training
Decoding
Monolingual Text Data
of target language
Target Text
23 February 2015
52
Discussion
23 February 2015
53
Acknowledgement
Parts of the content of this lecture are
taken from previous lectures and
presentations given by Jennifer Foster,
Declan Groves, Yvette Graham, Kevin
Knight, Josef van Genabith, Andy Way.
23 February 2015
54