Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ECOE.554 Homework 4: Unsupervised Learning Due: Jan 23, 2011 • This homework consists of a written section and a programming section. • Please submit the written section answers in hard-copy to Eng228 and submit the programming section answers in soft-copy to [email protected]. • Late submission of the project will not be accepted. 1 Written Section 1.1 Final Report Please return a written report of your answer to the fun part of the programming section. 1.2 EM algorithm Suppose we wish to use the EM algorithm to maximize the posterior distribution over parameters p(θ|X) for a model containing latent variables, where X is the observed data set. Show that the E step remaining the same as in the maximum likelihood case, whereas in the M step instead of maximizing Q(θ, θold ) it maximizes the prior added quantity Q(θ, θold ) + ln p(θ) where Q(θ, θold ) is defined by Q(θ, θold ) = P old Z p(Z|X, θ )lnp(X, Z|θ). You can find the definition of Q and θ in PRML Chapter 9. 1.3 Regular–EM versus Viterbi–EM We define a new EM algorithm and change the E step of the regular EM with Viterbi decoding. Both regular EM and Viterbi-EM algorithms are maximizing some objective function. 1. Write down the mathematical formulation of the Viterbi-EM. 2. Are regular-EM and Viterbi-EM optimizing the same thing? Explain the objective function of both algorithms if your answer is no. 1 3. Do these algorithms converge to the same θ? 1.4 Dirichlet Distribution and Gibbs Sampling In the programming section you are going to implement a Gibbs sampler on a HMM model. The model and its parameters are defined as, θt |αt ∼ Dir(αt ) φt |αw ∼ Dir(αw ) ti |ti−1 = t ∼ M ulti(θt ) wi |ti = t ∼ M ulti(φt ) (1) Equation 1 describes the distributions of the first order HMM model[1]. In order to do Gibbs sampling without taking the multinomial parameter samples you need to integrate out the multinomial parameters. Using the model defined in Equation 1 show that n(ti ,ti−1 ) + αt P (ti |t−i , αt ) = n(ti−1 ) + T αt n(ti ,wi ) + αw P (wi |w−i , αw ) = (2) n(ti ) + W αw where n(ti ,ti−1 ) , n(ti ) and n(ti ,wi ) are the number of occurrences of the bi-gram (ti−1 , ti ), uni-gram (ti ) and tag-word pair (ti , wi ). t−i and w−i are the counts excluding the xi and yi . W and T are the sizes of the tag and word vocabulary, respectively. 2 Programming Section In this part of the project you are going to implement an unsupervised part-of-speech tagger with Java. You can ask Java related questions to TA@Eng228. In order to guide you and make the grading easier template class files and interfaces together with the data files and reference papers are available in the homework package. Please do not change the class, method and field names however you can add your own methods and fields if it is necessary. Any modification in these names might fail the automatic grading procedure. 2.1 Introduction The unsupervised part-of-speech (POS) tagging problem can be defined as predicting the correct part-of-speech tag of an ambiguous word in a given context using an unlabeled corpus and a dictionary with possible word–tag pairs. For example, We focus our attention on the Hidden Markov Model (HMM) since it is the simplest model based approach and has a traditional place in the literature of unsupervised POS 2 He|PRP was|VBD previously|RB vice|NN president|NN .|. Table 1: An example sentence from the homework data. Correct POS of each word is delimited with a vertical bar. More detailed definition of tags can be found at http: //bulba.sdsu.edu/jeanette/thesis/PennTags.html tagging. In a standard first order HMM each hidden state is conditioned by its preceding hidden state and each observation is conditioned by its corresponding hidden state. In POS tagging, the observed variable sequence is a sentence s0 and the hidden variables ti are the POS tags of the words wi in s0 as it is illustrated in Figure 1. Figure 1: Graphical structure of the traditional first order HMM model for POS tagging. 2.2 Data Files There are two data files in this homework: 1. wsj.te24000.tok is the data file consists of 1020 sentences. Each line of this data file is a tokenized English sentence from the Wall Street Journal corpus and the correct tag of each word is delimited with a vertical bar. 2. wsj.dic is the dictionary file that keeps the possible tags for the each word in the data file. Each line of the dictionary is a word followed by the possible tags of that word. To be consistent with the POS literature, the tag dictionary is constructed by listing all observed tags for each word in the entire Penn Treebank which is 1M word POS tagged corpus. 2.3 Classes You will implement the EMBiHmm, Gibbs and Viterbi classes. For each class you have to implement the interface of each class otherwise you lost grades. As long as you implement the interface you can add any necessary functions that you want to use in any classes. 3 We provide DataReader, Matrix and Probability classes to ease your work during this homework. Please read the Appendix to understand the details of these classes. 2.3.1 HMM with Expectation Maximization The HMM model parameter θ can be estimated by using Baum-Welch EM algorithm on an unlabeled training corpus D. The model parameter is defined as θ = {πi , aij , bij } where πi is the probability of observing ith state as the initial hidden state, aik is the transition probability to state j given the previous state is i and finally bik is the probability of word k given that the state is i. The tag sequence that maximizes Pr(t|s, θ̂) can be identified by the Viterbi search algorithm. HMM-EM is a two phase algorithm: (1) it estimates θ using Expectation Maximization (EM) and (2) identifies the most probable hidden value sequence of a given sentence by applying the Viterbi search algorithm. For the EM and the Viterbi decoding please read the article [2] 2.3.2 EMBiHmm You are going to define a EM class that estimates parameters of a first order HMM. The fields of the class is defined for you however feel free to add any extra fields as long as you do not remove any predefined one. The data field keeps the DataReader object. You can find the definition of these class in the related reading. Please read the Appendix A before coding these class in order to understand integer mapping and the other details of the DataReader class that is ready to use. The π vector is kept in pi, transition log–probabilities between current and previous state, aij , are kept in s2s matrix where each row represents the previous state. Finally, the log–probability of observing a word with a given tag, biw , is kept in w2s matrix where each row represents the current state. The states and words fields store the number of tags and words respectively. The obs field keeps the integer mapped training sentences and iter keeps the maximum number of iteration that the trainEM can perform and by default it is set to 20. The totalLogTrain field keeps the summation of log-likelihood of each observation calculated in the last iteration of the EM training. Instead of using log(0)directly you can use the logzero field which is very small number but not infinity. Constructor The constructor of EMBiHmm first calls the DataReader constructor to read the data then sets the number of states and words in order to use during the memory allocation of matrices. Finally, it uniformly initializes the matrices by default. You are required 4 to code the methods that are represented in the EMBiHmm interface. Likelihood? In order to calculate the probability of a tag sequence of a sentence you will implement the likelihood method which takes an integer mapped word array and returns its log– probability. Initialization of Matrices Methods in this title initialize the log–probability matrices of the model. initUniformPi uniformly initializes pi. Similarly, initUniformS2S uniformly initializes the s2s matrix. initDicUniformW2S method initializes the w2s matrix using the dictionary. Thus only the entries observed in the dictionary will be initialized and the rest will be set to log(0). The probability is shared uniformly among the observed entries. This method reduces the model complexity by reducing the possible tags for each word. Hint: Keep in mind that start and end tags are added by us therefore none of the words can have their tags thus the corresponding entries of the w2s matrix can be set to log(0). Now it is time for cheating. initGoldW2S, initGoldS2S and initGoldPi methods initialize the corresponding matrices with the correct probabilities that are estimated from the labels of the training data. The estimation procedure simply counts the occurrences of bi-gram tag pairs, word–tag pairs then normalizes this count and use them as the gold probabilities (for example, s2s[i][j] = log(count(i, j)/count(i))). Since we and a start–tag to each sentence we know that the initial tag is same for all sentences thus pi[0] = log(1) and the rest of the entries are equal to P robability.logzero. Forward Probabilities The forwardProbs method takes an observation array which is nothing but integer mapped sentence and returns a 2D array that keeps the forward messages in log scale. Please keep in mind that you should use P robabilities.logsum(a, b) function to calculate the sum of logarithms. Backward Probabilities The backwardProbs method takes an observation array which is nothing but integer mapped sentence and returns a 2D array that keeps the backward messages in log scale. Training EM This method is the main part of the homework. Before doing this part please make sure that the backward and forward probabilities are calculated correctly. The trainEm method iterates on the observations and optimizes the pi, w2s and s2s matrices. 2.3.3 HMM with Gibbs Sampling The HMM model parameters can be integrated out by using the Gibbs sampling algorithm on an unlabeled training corpus D. In this homework you will use the framework defined in the Gibbs Sampling section of [1] which is same with the model defined in 5 Equation 1 except the parameter names. Johnson uses x and y to refer to words and states, respectively while we use w and t. Similarly, αx and αy are the hyper–parameters of the model in [1] whereas αw and αt are the corresponding hyper–parameters of our model. The Gibbs sampler that you are going to implement will sample from the conditional distribution of state ti , P (ti |w, t−i , α) where t−1 is the states of all observations except wi and α = {αw , αt } 2.3.4 Gibbs In this part of the homework you are going to implement a Gibbs sampler that estimates the desired parameters and probabilities of a first order HMM. The fields of the class is defined for you however feel free to add any extra fields as long as you do not remove any predefined one. Similar to EMBiHmm, The data field keeps the DataReader object. You can find the definition of these procedures in Appendix A. The pi, s2s, w2s, states, words and obs fields are common with the EMBiHmm class and their definitions can be found in Section 2.3.2. The iter field is the number of Gibbs samplings over all observations. The sampleCounter keeps the number of sampling therefore it must be initialized to 0 in initDicUniformTags and must be incremented whenever the Probability.sampleMultinomial method is called in sampleGibbs. During the whole sampling process the algorithm collects samples at predetermined sampleCounter values and uses the average of these multiple samples to estimate the model parameters. The beginning samples are correlated with the initialization of the word–tag pairs therefore some predefined number of the beginning samples are ignored and this period is called burn-in period. The algorithm does not collect any sample until the burn-in period is over and the number of required samplings in burn-in period is defined by the burnin field. After sampling burnin samples the algorithm starts to collect samples at every k step and the value of k is stored in the sampleStep field. The sampleTaken field keeps the number of collected samples after the burn-in period. To make things clear, lets assume that the burn-in period is 1000 samples(burnin) and we are going to collect samples at each 100 samplings(sampleStep). If the algorithm samples 3000 times then the number of collected samples will be 20(sampleTaken). The average of these samples are used to estimate the s2s and w2s matrices. Each estimate of these matrices are added to integrateS2S and integrateW2S, respectively. After collecting the last sample these integration matrices normalized in order to average these estimations. The cs2s, cw2s and cs fields keep the counts of tag bi-grams, word–tag pairs and tag occurrences over all observations. These fields are set to their initial values by countOccurance method after random initialization of the observation tags according to the dictionary by using the initDicUniformTags function. The gibbsTags field keeps the random initialized tags of each observation. 6 The alphaw and alphat fields keep the value of the hyper–parameter alphaw and alphat . Constructor The constructor initializes all the fields and then randomly assigns tags by using initDicUniformTags. Finally, it calls countOccurance to set the initial count fields introduced above. Initialize the tags The initDicUniformTags method randomly assigns a tag from the possible tags of a given observation and writes these random tags to gibbsTags. The possible set of tags are determined by the dictionary file. Counting The countOccurance method counts the tag bi-grams, word–tag pairs and occurrence of each tag in gibbsTags and writes each count to cs2s, cw2s and cs, accordingly. Gibbs Sampling The sampleGibbs is the core method of the class it samples over all observations iter times. This function updates the count fields of the class according to the sampling process and also collects samples at desired iterations by calling getSample function. Collecting Samples The getSample method collects samples at predetermined steps. The steps are determined according to the sampleStep and burnin fields as it is described in the beginning of these section. The collected samples are added to the integration fields to use during the decoding. Estimating Bi-gram Tag Probabilities The estimateS2S method estimates a probability array by using the cs2s and cs count arrays that are updated by sampleGibbs. It takes a 2D double array and fills this array with the estimated probabilities. Similarly, the estimateW2S method estimates a probability array by using the cw2s and cs count arrays that are updated by sampleGibbs. It takes a 2D double array and fills this array with the estimated probabilities. 2.3.5 Viterbi This class implements the Viterbi decoding algorithm that finds out the tag sequence with the highest log–likelihood of an observation given the pi, s2s and w2s probability arrays. Viterbi decoder uses the log probabilities. The pi, s2s, w2s, states, words and obs fields are common with the EMBiHmm and EMBiHmm class and their definitions can be found in Section 2.3.2 and 2.3.4. The key field keeps the correct tags of the observations that are read from the input 7 data file while the ans field keeps the tags that are assigned by the methods. The accuracy field keeps the accuracy of the model and it is set by the score functions. Similarly, loglikelihood keeps the total likelihood of model answers and it is initially set to Probability.logzero. Constructor There are two constructors for this class, the first one initializes the above parameters using EMBiHmm while the second one uses Gibbs. Viterbi Decoding The viterbi method takes an observation array and returns tag sequence with highest log–likelihood given the probability matrices. This function also adds the likelihood of the current observation to the loglikelihood field. Viterbi on all observations The viterbiAllObs method calls the viterbi function for every observation and it is given to you. Scoring There are two scoring functions in this homework, the first one does not take any input and just runs Viterbi with the initialized parameters while the second one takes a vector of answer tags and only compares this answer with the correct tags that are kept in the key field. Thus if you call the first score function loglikelihood and accuracy fields will be updated on the other hand the second only updates the accuracy. Both of the functions are given. 2.4 Fun Time Please perform the following steps for the EM model. For each case below please report the log–likelihood of the model at every 200 observations, initial and final model size and changes of the log–likelihood. Also report the initial accuracy(without training with EM) and final accuracy of each case by using Viterbi decoding. Please comment on your results. Bonus: Is there anything unexpected when you compare the initial and final accuracies? Hint: Model size can be determined by counting the nonzero entries of the probability matrices 1. Do uniform initialization with dictionary and train EM report the total log– likelihood. 2. Initialize s2s with gold and w2s with uniform probabilities. 3. Initialize w2s with gold and s2s with uniform probabilities. 4. Do gold initialization and train EM report the total log–likelihood. 8 Please perform the following steps for the Gibbs model. 1. Run Gibbs with these setting: alphat = [0.1, 0.01, 0.001, 0.0001] and alphaw = [1, 0.01, 0.001]. For each model calculate the model accuracies by: (1) using the final sample to estimate parameters of Vader, (2) using the average of the collected samples to estimate parameters of Viterbi and (3) using the tags that are assigned by the final sample(do not run Viterbi in this case). Thus you are going to report 4x3x3 = 36 results. 2. Instead of iterating over all observation 100 times iterate 1000 times and calculate the accuracies in same a manner with the previous step for the worst and best parameter settings of the previous step. Appendix A util.DataReader This class represents the data reader object. You can find the definitions of the object fields and methods in DataReader.java. Constructor The DataReader objects are constructed by calling the constructor of the DataReader class. The constructor takes two variables: (1) training file name (2) vocabulary file name and then initializes the fields of the class. The name of the training and the vocabulary file are written to the fileName and vocName fields, respectively. In order to make the calculations easier constructor first processes each sentence by concatenating start and end tags then it maps each word and tag to a unique integer. the data field stores the training data sentences mapped to integer domain while tags keeps the corresponding tags of each word mapped to integer domain. The number of unique words and tags are stored in wc and tc, respectively. The mapping from word to integer and tag to integer are kept in the word2int and tag2int hashes. Similarly int2word and int2tag keep the mapping from the integer domain to word and tag domain, respectively. The mappings are dumped to data.map file in the data folder of the homework thus you can see the integer mappings of each word. Finally, the constructor reads the dictionary file into voc. The concatenation of the start and end tags does not increase the problem complexity since they have only one possible tag in the dictionary. voc is a 2D array in which the first dimension keeps the words while the second dimension keeps the possible tags of the corresponding word. For example, the integer mappings of start and end tags are 0 and 1 and the integer mappings of their POS tags are also 0 and 1 thus voc[0][0] = 0 and voc[1][0] = 1. Getters and Setters 9 All the fields have getters and setters methods even-though you are not going to edit DataReader fields. Preprocess of the sentences Given a string, preProcess concatenates the start–tag(<S>) and end–tag(</S>. The start and end tag enable the model to determine starting and ending of a sentence. toString The toStringTags method returns the tags sorted according to their integer mappings. Similarly, toStringHashMap converts the target hash into string where each line consists of the word and its integer mapping. B util.Matrix This class performs various common operations on matrices and array. It is defined statically therefore you can reach its methods by using the class name directly (for example Matrix.convertLog). The only field is the static variable logzero which is very small negative number but not minus infinity. Number of non–zero entries? The countNonZero method takes an 2D integer array and counts the number of non–zero entries and returns it. Converting the matrices The convertLog and convertExp methods takes the natural logarithm and exponential of a given 2D double matrix, respectively. Normalize matrices The normalize method takes a 2D double array and row wise normalizes it. Comparing vectors The compareVectors method compares two integer vectors element by element and returns the number of common elements. C util.Probability This class performs probability related function that are used in other classes. It is a static class therefore you can reach its methods by using the class name directly. There are four static fields defined in this class:(1)logzero is a very small negative number used instead of minus infinity, (2)epsilon is the maximum amount of numerical error therefore please use epsilon while checking whether probabilities add up to 1 or not, (3) r is a Random object that generates the random numbers and (4) seed is the initial seed value 10 of the r. Summation of logarithms: Numerical problems In order to prevent underflow and overflow summation of log–probabilities requires special attention. Therefore you can use logSum method which takes two log–probabilities and returns the log of their summation. Bonus: Explain why this function works. Is everything OK? Since we are dealing with probabilities it is necessary to check whether the probabilities add up to 1 or not. It can be checked by probabilityCheck or logProbabilityCheck methods which take a probability or a log–probability array, respectively and checks whether their summation add up to 1 or not. Sample from Multinomial The sampleMultinomail method takes a probability vector and samples a point from this vector according to the probabilities. References [1] M. Johnson. Why doesnt EM find good HMM POS-taggers. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 296–305, 2007. [2] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. 11