Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NLP-AI Seminar Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla Outline • Introduction • Directed Graphical Models – Hidden Markov Models (HMMs) – Maximum Entropy Markov Models (MEMMs) • Label Bias Problem • Undirected Graphical Models – Conditional Random Fields (CRFs) • Summary The Task • Labeling – Given sequence data, mark appropriate tags for each data item • Segmentation – Given sequence data, segment into nonoverlapping groups such that related entities are in same group Applications • Computational Linguistics – POS Tagging – Information Extraction – Syntactic Disambiguation • Computational Biology – DNA and Protein Sequence Alignment – Sequence homologue searching – Protein Secondary Structure Prediction Example : POS Tagging Directed Graphical Models • Hidden Markov models (HMMs) – Assign a joint probability to paired observation and label sequences – The parameters trained to maximize the joint likelihood of train examples Hidden Markov Models (HMMs) • Generative Model - Models the joint distribution P ( w, t ) • Generation Process – – – – – Probabilistic Finite State Machine Set of states – Correspond to tags Alphabet - Set of words Transition Probability – P(ti | ti 1) State Probability – P ( wi | ti ) HMMs (Contd..) • For a given word/tag sequence pair P( w, t ) P(ti | ti 1) * P( wi | ti ) i • Why Hidden? – Sequence of tags which generated word sequence not visible • Why Markov? – Based on Markovian Assumption : current tag depends only on previous ‘n’ tags – Solves the “sparsity problem” • Training – Learning the transition and emission probabilities from data HMMs Tagging Process • Given a string of words w, choose tag sequence t* such that t* arg max P( w, t ) t • Computationally expensive - Need to evaluate all possible tag sequences! – For ‘n’ possible tags, m positions – O(n m ) • Viterbi Algorithm – Used to find the optimal tag sequence t* – Efficient dynamic programming based algorithm Disadvantages of HMMs • Need to enumerate all possible observation sequences • Not possible to represent multiple interacting features • Difficult to model long-range dependencies of the observations • Very strict independence assumptions on the observations Maximum Entropy Markov Models (MEMMs) • Conditional Exponential Models – Assumes observation sequence given (need not model) – Trains the model to maximize the conditional likelihood P(Y|X) MEMMs (Contd..) • For a new data sequence x, the label sequence y which maximizes P(y|x,Θ) is assigned (Θ parameter set) • Arbitrary non-independent features on observation sequence possible • Conditional Models known to perform well than Generative • Performs Per-State Normalization – Total mass which arrives at a state must be distributed among all possible successor states Label Bias Problem • Bias towards states with fewer outgoing transitions • Due to per-state normalization • An Example MEMM Undirected Graphical Models Random Fields Conditional Random Fields (CRFs) • Conditional Exponential Model like MEMM • Has all the advantages of MEMMs without label bias problem – MEMM uses per-state exponential model for the conditional probabilities of next states given the current state – CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence • Allow some transitions “vote” more strongly than others depending on the corresponding observations Definition of CRFs CRF Distribution Function 1 p (y | x) exp k f k (e, y |e , x) k g k (v, y |v , x) Z (x) vV ,k eE,k Where : V = Set of Label Random Variables fk and gk = Features gk = State Feature fk = Edge Feature (1 , 2 , , n ; 1, 2 , , n ); k and k are parameters to be estimated y|e = Set of Components of y defined by edge e y|v = Set of Components of y defined by vertex v CRF Training CRF Training (Contd..) • Condition for maximum likelihood Expected feature count computed using Model equals Empirical feature count from training data • Closed form solution for parameters not possible • Iterative algorithms employed - Improve log likelihood in successive iterations • Examples – Generalized Iterative Scaling (GIS) – Improved Iterative Scaling (IIS) Graphical Comparison HMMs, MEMMs, CRFs POS Tagging Results Summary • HMMs – Directed, Generative graphical models – Cannot be used to model overlapping features on observations • MEMMs – Directed, Conditional Models – Can model overlapping features on observations – Suffer from label bias problem due to per-state normalization • CRFs – Undirected, Conditional Models – Avoids label bias problem – Efficient training possible Thanks! Acknowledgements Some slides in this presentation are from Rongkun Shen’s (Oregon State Univ) Presentation on CRFs