Download ECOE.554 Homework 4: Unsupervised Learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
ECOE.554 Homework 4: Unsupervised Learning
Due: Jan 23, 2011
• This homework consists of a written section and a programming section.
• Please submit the written section answers in hard-copy to Eng228 and submit the
programming section answers in soft-copy to [email protected].
• Late submission of the project will not be accepted.
1
Written Section
1.1
Final Report
Please return a written report of your answer to the fun part of the programming section.
1.2
EM algorithm
Suppose we wish to use the EM algorithm to maximize the posterior distribution
over parameters p(θ|X) for a model containing latent variables, where X is the observed data set. Show that the E step remaining the same as in the maximum likelihood case, whereas in the M step instead of maximizing Q(θ, θold ) it maximizes the
prior added quantity Q(θ, θold ) + ln p(θ) where Q(θ, θold ) is defined by Q(θ, θold ) =
P
old
Z p(Z|X, θ )lnp(X, Z|θ). You can find the definition of Q and θ in PRML Chapter
9.
1.3
Regular–EM versus Viterbi–EM
We define a new EM algorithm and change the E step of the regular EM with Viterbi
decoding. Both regular EM and Viterbi-EM algorithms are maximizing some objective
function.
1. Write down the mathematical formulation of the Viterbi-EM.
2. Are regular-EM and Viterbi-EM optimizing the same thing? Explain the objective
function of both algorithms if your answer is no.
1
3. Do these algorithms converge to the same θ?
1.4
Dirichlet Distribution and Gibbs Sampling
In the programming section you are going to implement a Gibbs sampler on a HMM
model. The model and its parameters are defined as,
θt |αt ∼ Dir(αt )
φt |αw ∼ Dir(αw )
ti |ti−1 = t ∼ M ulti(θt )
wi |ti = t ∼ M ulti(φt )
(1)
Equation 1 describes the distributions of the first order HMM model[1]. In order to
do Gibbs sampling without taking the multinomial parameter samples you need to
integrate out the multinomial parameters. Using the model defined in Equation 1 show
that
n(ti ,ti−1 ) + αt
P (ti |t−i , αt ) =
n(ti−1 ) + T αt
n(ti ,wi ) + αw
P (wi |w−i , αw ) =
(2)
n(ti ) + W αw
where n(ti ,ti−1 ) , n(ti ) and n(ti ,wi ) are the number of occurrences of the bi-gram (ti−1 , ti ),
uni-gram (ti ) and tag-word pair (ti , wi ). t−i and w−i are the counts excluding the xi
and yi . W and T are the sizes of the tag and word vocabulary, respectively.
2
Programming Section
In this part of the project you are going to implement an unsupervised part-of-speech
tagger with Java. You can ask Java related questions to TA@Eng228. In order to
guide you and make the grading easier template class files and interfaces together with
the data files and reference papers are available in the homework package. Please do
not change the class, method and field names however you can add your own methods
and fields if it is necessary. Any modification in these names might fail the automatic
grading procedure.
2.1
Introduction
The unsupervised part-of-speech (POS) tagging problem can be defined as predicting the
correct part-of-speech tag of an ambiguous word in a given context using an unlabeled
corpus and a dictionary with possible word–tag pairs. For example,
We focus our attention on the Hidden Markov Model (HMM) since it is the simplest
model based approach and has a traditional place in the literature of unsupervised POS
2
He|PRP was|VBD previously|RB vice|NN president|NN .|.
Table 1: An example sentence from the homework data. Correct POS of each word is
delimited with a vertical bar. More detailed definition of tags can be found at http:
//bulba.sdsu.edu/jeanette/thesis/PennTags.html
tagging. In a standard first order HMM each hidden state is conditioned by its preceding
hidden state and each observation is conditioned by its corresponding hidden state. In
POS tagging, the observed variable sequence is a sentence s0 and the hidden variables
ti are the POS tags of the words wi in s0 as it is illustrated in Figure 1.
Figure 1: Graphical structure of the traditional first order HMM model for POS tagging.
2.2
Data Files
There are two data files in this homework:
1. wsj.te24000.tok is the data file consists of 1020 sentences. Each line of this data
file is a tokenized English sentence from the Wall Street Journal corpus and the
correct tag of each word is delimited with a vertical bar.
2. wsj.dic is the dictionary file that keeps the possible tags for the each word in the
data file. Each line of the dictionary is a word followed by the possible tags of that
word. To be consistent with the POS literature, the tag dictionary is constructed
by listing all observed tags for each word in the entire Penn Treebank which is
1M word POS tagged corpus.
2.3
Classes
You will implement the EMBiHmm, Gibbs and Viterbi classes. For each class you have
to implement the interface of each class otherwise you lost grades. As long as you
implement the interface you can add any necessary functions that you want to use in
any classes.
3
We provide DataReader, Matrix and Probability classes to ease your work during this
homework. Please read the Appendix to understand the details of these classes.
2.3.1
HMM with Expectation Maximization
The HMM model parameter θ can be estimated by using Baum-Welch EM algorithm
on an unlabeled training corpus D. The model parameter is defined as θ = {πi , aij , bij }
where πi is the probability of observing ith state as the initial hidden state, aik is
the transition probability to state j given the previous state is i and finally bik is the
probability of word k given that the state is i.
The tag sequence that maximizes Pr(t|s, θ̂) can be identified by the Viterbi search
algorithm.
HMM-EM is a two phase algorithm: (1) it estimates θ using Expectation Maximization
(EM) and (2) identifies the most probable hidden value sequence of a given sentence
by applying the Viterbi search algorithm. For the EM and the Viterbi decoding please
read the article [2]
2.3.2
EMBiHmm
You are going to define a EM class that estimates parameters of a first order HMM.
The fields of the class is defined for you however feel free to add any extra fields as
long as you do not remove any predefined one. The data field keeps the DataReader
object. You can find the definition of these class in the related reading. Please read the
Appendix A before coding these class in order to understand integer mapping and the
other details of the DataReader class that is ready to use.
The π vector is kept in pi, transition log–probabilities between current and previous
state, aij , are kept in s2s matrix where each row represents the previous state. Finally,
the log–probability of observing a word with a given tag, biw , is kept in w2s matrix
where each row represents the current state.
The states and words fields store the number of tags and words respectively. The obs
field keeps the integer mapped training sentences and iter keeps the maximum number
of iteration that the trainEM can perform and by default it is set to 20.
The totalLogTrain field keeps the summation of log-likelihood of each observation calculated in the last iteration of the EM training.
Instead of using log(0)directly you can use the logzero field which is very small number
but not infinity.
Constructor
The constructor of EMBiHmm first calls the DataReader constructor to read the data
then sets the number of states and words in order to use during the memory allocation
of matrices. Finally, it uniformly initializes the matrices by default. You are required
4
to code the methods that are represented in the EMBiHmm interface.
Likelihood?
In order to calculate the probability of a tag sequence of a sentence you will implement
the likelihood method which takes an integer mapped word array and returns its log–
probability.
Initialization of Matrices
Methods in this title initialize the log–probability matrices of the model. initUniformPi
uniformly initializes pi. Similarly, initUniformS2S uniformly initializes the s2s matrix.
initDicUniformW2S method initializes the w2s matrix using the dictionary. Thus only
the entries observed in the dictionary will be initialized and the rest will be set to log(0).
The probability is shared uniformly among the observed entries. This method reduces
the model complexity by reducing the possible tags for each word. Hint: Keep in mind
that start and end tags are added by us therefore none of the words can have their tags
thus the corresponding entries of the w2s matrix can be set to log(0).
Now it is time for cheating. initGoldW2S, initGoldS2S and initGoldPi methods
initialize the corresponding matrices with the correct probabilities that are estimated
from the labels of the training data. The estimation procedure simply counts the occurrences of bi-gram tag pairs, word–tag pairs then normalizes this count and use them
as the gold probabilities (for example, s2s[i][j] = log(count(i, j)/count(i))). Since we
and a start–tag to each sentence we know that the initial tag is same for all sentences
thus pi[0] = log(1) and the rest of the entries are equal to P robability.logzero.
Forward Probabilities
The forwardProbs method takes an observation array which is nothing but integer
mapped sentence and returns a 2D array that keeps the forward messages in log scale.
Please keep in mind that you should use P robabilities.logsum(a, b) function to calculate
the sum of logarithms.
Backward Probabilities
The backwardProbs method takes an observation array which is nothing but integer
mapped sentence and returns a 2D array that keeps the backward messages in log scale.
Training EM
This method is the main part of the homework. Before doing this part please make
sure that the backward and forward probabilities are calculated correctly. The trainEm
method iterates on the observations and optimizes the pi, w2s and s2s matrices.
2.3.3
HMM with Gibbs Sampling
The HMM model parameters can be integrated out by using the Gibbs sampling algorithm on an unlabeled training corpus D. In this homework you will use the framework
defined in the Gibbs Sampling section of [1] which is same with the model defined in
5
Equation 1 except the parameter names. Johnson uses x and y to refer to words and
states, respectively while we use w and t. Similarly, αx and αy are the hyper–parameters
of the model in [1] whereas αw and αt are the corresponding hyper–parameters of our
model.
The Gibbs sampler that you are going to implement will sample from the conditional
distribution of state ti , P (ti |w, t−i , α) where t−1 is the states of all observations except
wi and α = {αw , αt }
2.3.4
Gibbs
In this part of the homework you are going to implement a Gibbs sampler that estimates
the desired parameters and probabilities of a first order HMM. The fields of the class is
defined for you however feel free to add any extra fields as long as you do not remove
any predefined one. Similar to EMBiHmm, The data field keeps the DataReader object.
You can find the definition of these procedures in Appendix A.
The pi, s2s, w2s, states, words and obs fields are common with the EMBiHmm class and
their definitions can be found in Section 2.3.2. The iter field is the number of Gibbs
samplings over all observations.
The sampleCounter keeps the number of sampling therefore it must be initialized to 0 in
initDicUniformTags and must be incremented whenever the Probability.sampleMultinomial
method is called in sampleGibbs. During the whole sampling process the algorithm collects samples at predetermined sampleCounter values and uses the average of these
multiple samples to estimate the model parameters. The beginning samples are correlated with the initialization of the word–tag pairs therefore some predefined number
of the beginning samples are ignored and this period is called burn-in period. The
algorithm does not collect any sample until the burn-in period is over and the number
of required samplings in burn-in period is defined by the burnin field. After sampling
burnin samples the algorithm starts to collect samples at every k step and the value of
k is stored in the sampleStep field. The sampleTaken field keeps the number of collected
samples after the burn-in period. To make things clear, lets assume that the burn-in
period is 1000 samples(burnin) and we are going to collect samples at each 100 samplings(sampleStep). If the algorithm samples 3000 times then the number of collected
samples will be 20(sampleTaken).
The average of these samples are used to estimate the s2s and w2s matrices. Each
estimate of these matrices are added to integrateS2S and integrateW2S, respectively.
After collecting the last sample these integration matrices normalized in order to average
these estimations.
The cs2s, cw2s and cs fields keep the counts of tag bi-grams, word–tag pairs and tag
occurrences over all observations. These fields are set to their initial values by countOccurance method after random initialization of the observation tags according to the
dictionary by using the initDicUniformTags function. The gibbsTags field keeps the
random initialized tags of each observation.
6
The alphaw and alphat fields keep the value of the hyper–parameter alphaw and alphat .
Constructor
The constructor initializes all the fields and then randomly assigns tags by using initDicUniformTags. Finally, it calls countOccurance to set the initial count fields introduced
above.
Initialize the tags
The initDicUniformTags method randomly assigns a tag from the possible tags of a
given observation and writes these random tags to gibbsTags. The possible set of tags
are determined by the dictionary file.
Counting
The countOccurance method counts the tag bi-grams, word–tag pairs and occurrence of
each tag in gibbsTags and writes each count to cs2s, cw2s and cs, accordingly.
Gibbs Sampling
The sampleGibbs is the core method of the class it samples over all observations iter
times. This function updates the count fields of the class according to the sampling
process and also collects samples at desired iterations by calling getSample function.
Collecting Samples
The getSample method collects samples at predetermined steps. The steps are determined according to the sampleStep and burnin fields as it is described in the beginning
of these section. The collected samples are added to the integration fields to use during
the decoding.
Estimating Bi-gram Tag Probabilities
The estimateS2S method estimates a probability array by using the cs2s and cs count
arrays that are updated by sampleGibbs. It takes a 2D double array and fills this array
with the estimated probabilities.
Similarly, the estimateW2S method estimates a probability array by using the cw2s and
cs count arrays that are updated by sampleGibbs. It takes a 2D double array and fills
this array with the estimated probabilities.
2.3.5
Viterbi
This class implements the Viterbi decoding algorithm that finds out the tag sequence
with the highest log–likelihood of an observation given the pi, s2s and w2s probability
arrays. Viterbi decoder uses the log probabilities.
The pi, s2s, w2s, states, words and obs fields are common with the EMBiHmm and
EMBiHmm class and their definitions can be found in Section 2.3.2 and 2.3.4.
The key field keeps the correct tags of the observations that are read from the input
7
data file while the ans field keeps the tags that are assigned by the methods.
The accuracy field keeps the accuracy of the model and it is set by the score functions.
Similarly, loglikelihood keeps the total likelihood of model answers and it is initially set
to Probability.logzero.
Constructor
There are two constructors for this class, the first one initializes the above parameters
using EMBiHmm while the second one uses Gibbs.
Viterbi Decoding
The viterbi method takes an observation array and returns tag sequence with highest
log–likelihood given the probability matrices. This function also adds the likelihood of
the current observation to the loglikelihood field.
Viterbi on all observations
The viterbiAllObs method calls the viterbi function for every observation and it is given
to you.
Scoring
There are two scoring functions in this homework, the first one does not take any input
and just runs Viterbi with the initialized parameters while the second one takes a vector
of answer tags and only compares this answer with the correct tags that are kept in
the key field. Thus if you call the first score function loglikelihood and accuracy fields
will be updated on the other hand the second only updates the accuracy. Both of the
functions are given.
2.4
Fun Time
Please perform the following steps for the EM model. For each case below please report
the log–likelihood of the model at every 200 observations, initial and final model size
and changes of the log–likelihood. Also report the initial accuracy(without training with
EM) and final accuracy of each case by using Viterbi decoding. Please comment on
your results. Bonus: Is there anything unexpected when you compare the
initial and final accuracies? Hint: Model size can be determined by counting
the nonzero entries of the probability matrices
1. Do uniform initialization with dictionary and train EM report the total log–
likelihood.
2. Initialize s2s with gold and w2s with uniform probabilities.
3. Initialize w2s with gold and s2s with uniform probabilities.
4. Do gold initialization and train EM report the total log–likelihood.
8
Please perform the following steps for the Gibbs model.
1. Run Gibbs with these setting: alphat = [0.1, 0.01, 0.001, 0.0001] and alphaw =
[1, 0.01, 0.001]. For each model calculate the model accuracies by: (1) using the
final sample to estimate parameters of Vader, (2) using the average of the collected
samples to estimate parameters of Viterbi and (3) using the tags that are assigned
by the final sample(do not run Viterbi in this case). Thus you are going to report
4x3x3 = 36 results.
2. Instead of iterating over all observation 100 times iterate 1000 times and calculate
the accuracies in same a manner with the previous step for the worst and best
parameter settings of the previous step.
Appendix
A
util.DataReader
This class represents the data reader object. You can find the definitions of the object
fields and methods in DataReader.java.
Constructor
The DataReader objects are constructed by calling the constructor of the DataReader
class. The constructor takes two variables: (1) training file name (2) vocabulary file
name and then initializes the fields of the class. The name of the training and the
vocabulary file are written to the fileName and vocName fields, respectively. In order to
make the calculations easier constructor first processes each sentence by concatenating
start and end tags then it maps each word and tag to a unique integer. the data
field stores the training data sentences mapped to integer domain while tags keeps the
corresponding tags of each word mapped to integer domain. The number of unique
words and tags are stored in wc and tc, respectively.
The mapping from word to integer and tag to integer are kept in the word2int and tag2int
hashes. Similarly int2word and int2tag keep the mapping from the integer domain to
word and tag domain, respectively. The mappings are dumped to data.map file in the
data folder of the homework thus you can see the integer mappings of each word.
Finally, the constructor reads the dictionary file into voc. The concatenation of the
start and end tags does not increase the problem complexity since they have only one
possible tag in the dictionary. voc is a 2D array in which the first dimension keeps the
words while the second dimension keeps the possible tags of the corresponding word.
For example, the integer mappings of start and end tags are 0 and 1 and the integer
mappings of their POS tags are also 0 and 1 thus voc[0][0] = 0 and voc[1][0] = 1.
Getters and Setters
9
All the fields have getters and setters methods even-though you are not going to edit
DataReader fields.
Preprocess of the sentences
Given a string, preProcess concatenates the start–tag(<S>) and end–tag(</S>. The
start and end tag enable the model to determine starting and ending of a sentence.
toString
The toStringTags method returns the tags sorted according to their integer mappings.
Similarly, toStringHashMap converts the target hash into string where each line consists
of the word and its integer mapping.
B
util.Matrix
This class performs various common operations on matrices and array. It is defined
statically therefore you can reach its methods by using the class name directly (for
example Matrix.convertLog). The only field is the static variable logzero which is very
small negative number but not minus infinity.
Number of non–zero entries?
The countNonZero method takes an 2D integer array and counts the number of non–zero
entries and returns it.
Converting the matrices
The convertLog and convertExp methods takes the natural logarithm and exponential
of a given 2D double matrix, respectively.
Normalize matrices
The normalize method takes a 2D double array and row wise normalizes it.
Comparing vectors
The compareVectors method compares two integer vectors element by element and returns the number of common elements.
C
util.Probability
This class performs probability related function that are used in other classes. It is a
static class therefore you can reach its methods by using the class name directly. There
are four static fields defined in this class:(1)logzero is a very small negative number used
instead of minus infinity, (2)epsilon is the maximum amount of numerical error therefore
please use epsilon while checking whether probabilities add up to 1 or not, (3) r is a
Random object that generates the random numbers and (4) seed is the initial seed value
10
of the r.
Summation of logarithms: Numerical problems
In order to prevent underflow and overflow summation of log–probabilities requires special attention. Therefore you can use logSum method which takes two log–probabilities
and returns the log of their summation. Bonus: Explain why this function works.
Is everything OK?
Since we are dealing with probabilities it is necessary to check whether the probabilities
add up to 1 or not. It can be checked by probabilityCheck or logProbabilityCheck methods
which take a probability or a log–probability array, respectively and checks whether their
summation add up to 1 or not.
Sample from Multinomial
The sampleMultinomail method takes a probability vector and samples a point from
this vector according to the probabilities.
References
[1] M. Johnson. Why doesnt EM find good HMM POS-taggers. In Proceedings of the
2007 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning (EMNLP-CoNLL), pages 296–305, 2007.
[2] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
11