Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discriminative vs. Generative Discriminative Learning for Structured Output Prediction CS4780/5780 โ Machine Learning Fall 2014 Bayes Decision Rule โ โ๐๐๐ฆ๐๐ ๐ฅ = ๐๐๐๐๐๐ฅ๐ฆโY ๐ ๐ = ๐ฆ ๐ = ๐ฅ = ๐๐๐๐๐๐ฅ๐ฆโY ๐ ๐ = ๐ฅ ๐ = ๐ฆ ๐ ๐ = ๐ฆ Generative: โ Idea: Make assumptions about ๐ ๐ = ๐ฅ ๐ = ๐ฆ , ๐ ๐ = ๐ฆ โ Method: Estimate parameters of the two distributions, then apply Bayes decision rule. Discriminative: โ Idea: Define set of prediction rules (i.e. hypotheses) H, then search for โ โ ๐ป that best approximates โ๐๐๐ฆ๐๐ ๐ฅ = ๐๐๐๐๐๐ฅ๐ฆโY ๐ ๐ = ๐ฆ ๐ = ๐ฅ Thorsten Joachims Cornell University Reading: T. Joachims, T. Hofmann, Yisong Yue, Chun-Nam Yu, Predicting Structured Objects with Support Vector Machines, Communications of the ACM, Research Highlight, 52(11):97-104, 2009. http://www.cs.cornell.edu/people/tj/publications/joachims_etal_09b.pdf Idea for Discriminative Training of HMM Bayes Rule Idea: โ โ๐๐๐ฆ๐๐ ๐ฅ = ๐๐๐๐๐๐ฅ๐ฆโY ๐ ๐ = ๐ฆ ๐ = ๐ฅ = ๐๐๐๐๐๐ฅ๐ฆโY ๐ ๐ = ๐ฅ ๐ = ๐ฆ ๐ ๐ = ๐ฆ โ Model ๐ ๐ = ๐ฆ ๐ = ๐ฅ with ๐ค โ ๐ ๐ฅ, ๐ฆ so that ๐๐๐๐๐๐ฅ๐ฆโY ๐ ๐ = ๐ฆ ๐ = ๐ฅ with ๐ค โ Joint Feature Map for Sequences Viterbi โ Each transition and emission has a weight โ Score of a sequence is the sum of its weights โ Find highest scoring sequence h(x) = ๐๐๐๐๐๐ฅ๐ฆโY ๐ค โ ๐ ๐ฅ, ๐ฆ y Det N V Det N The dog chased the cat Training HMMs with Structural SVM โข HMM ๐ ๐(๐ฅ, ๐ฆ) = ๐ ๐ฆ1 ๐(๐ฅ1 |๐ฆ1 ) ๐ ๐ฅ๐ ๐ฆ๐ ๐ ๐ฆ๐ ๐ฆ๐โ1 ) ๐=2 ๐ log ๐(๐ฅ, ๐ฆ) = ๐๐๐๐ ๐ฆ1 + ๐๐๐๐(๐ฅ1 |๐ฆ1 ) ๐๐๐๐ ๐ฅ๐ ๐ฆ๐ + ๐๐๐๐ ๐ฆ๐ ๐ฆ๐โ1 ) ๐=2 โข Define ๐ ๐ฅ, ๐ฆ so that model is isomorphic to HMM โ๐ โ Tune ๐ค so that correct ๐ฆ has the highest value of ๐ค โ ๐ ๐ฅ, ๐ฆ โ ๐ ๐ฅ, ๐ฆ is a feature vector that describes the match between ๐ฅ and ๐ฆ x The dog chased the cat Question: Can we train HMMโs discriminately? = ๐๐๐๐๐๐ฅ๐ฆโY ๐ค โ ๐ ๐ฅ, ๐ฆ Hypothesis Space: h(x) = ๐๐๐๐๐๐ฅ๐ฆโY ๐ค โ ๐ ๐ฅ, ๐ฆ Intuition: โข Linear Chain HMM โ Method: find โ โ ๐ป that minimizes training error. ๏ฆ 2 ๏ถ Det ๏ฎ N ๏ง ๏ท ๏ง 0 ๏ท Det ๏ฎ V ๏ง1๏ท N ๏ฎ V ๏ง ๏ท ๏ง 1 ๏ท V ๏ฎ Det ๏ง๏๏ท ๏ (x, y ) ๏ฝ ๏ง ๏ท ๏ง 0 ๏ท Det ๏ฎ dog ๏ง ๏ท ๏ง 2 ๏ท Det ๏ฎ the ๏ง 1 ๏ท N ๏ฎ dog ๏ง ๏ท ๏ง 1 ๏ท V ๏ฎ chased ๏ง 1 ๏ท N ๏ฎ cat ๏จ ๏ธ โ โ โ โ One feature for each possible start state One feature for each possible transition One feature for each possible output in each possible state Feature values are counts Structural Support Vector Machine โข Joint features ๐ ๐ฅ, ๐ฆ describe match between x and y โข Learn weights ๐ค so that ๐ค โ ๐ ๐ฅ, ๐ฆ is max for correct y โฆ 1 Soft-Margin Structural SVM Structural SVM Training Problem Hard-margin optimization problem: โข Loss function ฮ ๐ฆ๐ , ๐ฆ measures match between target and prediction. โข Training Set: ๐ฅ1 , ๐ฆ1 , โฆ , (๐ฅ๐ , ๐ฆ๐ ) โข Prediction Rule: h๐ ๐ฃ๐ ๐ฅ = ๐๐๐๐๐๐ฅ๐ฆโY ๐ค โ ๐ ๐ฅ, ๐ฆ โข Optimization: โฆ โ Correct label yi must have higher value of ๐ค โ ๐ ๐ฅ, ๐ฆ than any incorrect label y โ Find weight vector with smallest norm Soft-Margin Structural SVM Experiment: Part-of-Speech Tagging โข Task โ Given a sequence of words x, predict sequence of tags y. Soft-margin optimization problem: x The dog chased the cat y Det N V Det N Lemma: The training loss is upper bounded by Test Accuracy (%) โ Dependencies from tag-tag transitions in Markov model. โข Model โ Markov model with one state per tag and words as emissions โ Each word described by ~250,000 dimensional feature vector (all word suffixes/prefixes, word length, capitalization โฆ) โข Experiment (by Dan Fleisher) โ Train/test on 7966/1700 sentences from Penn Treebank 97.00 96.50 96.00 95.50 95.00 94.50 94.00 96.49 95.78 95.02 94.68 Brill (RBT) HMM (ACOPOST) kNN (MBT) Tree Tagger SVM Multiclass (SVM-light) SVM-HMM (SVM-struct) Experiment: Named Entity Recognition NE Identification โข Identify all named locations, named persons, named organizations, dates, times, monetary amounts, and percentages. 95.75 95.63 โข Data โ Spanish Newswire articles โ 300 training sentences โ 9 tags โข no-name, โข beginning and continuation of person name, organization, location, misc name โ Output words are described by features (e.g. starts with capital letter, contains number, etc.) โข Error on test set (% mislabeled tags): โ Generative HMM: 9.36% โ Support Vector Machine HMM: 5.08% 2 Cutting-Plane Algorithm for Structural SVM โข Input: โข โข REPEAT โ FOR โข compute โข IF โข Supervised Learning from Examples Violated by more than ๏ฅ ? Find most violated constraint โ Find function from input space X to output space Y โ: ๐ โ ๐ such that the prediction error is low. โข Typical _ โ Output space is just a single number โ optimize StructSVM over โข ENDIF โ ENDFOR โข UNTIL has not changed during iteration Add constraint to working set โข Classification: -1,+1 โข Regression: some real number โข General โ Predict outputs that are complex objects ๏ Polynomial Time Algorithm (SVM-struct) Examples of Complex Output Spaces Examples of Complex Output Spaces โข Multi-Label Classification โข Natural Language Parsing โ Given a sequence of words x, predict the parse tree y. โ Dependencies from structural constraints, since y has to be a tree. y S NP x The dog chased the cat N VP V โ Given a (bag-of-words) document x, predict a set of labels y. โ Dependencies between labels from correlations between labels (โiraqโ and โoilโ in newswire corpus) x Due to the continued violence NP Det Det N Examples of Complex Output Spaces โข Noun-Phrase Co-reference in Baghdad, the oil price is expected to further increase. OPEC officials met with โฆ y -1 -1 -1 +1 +1 -1 -1 -1 antarctica benelux germany iraq oil coal trade acquisitions Examples of Complex Output Spaces โข Scene Recognition โ Given a set of noun phrases x, predict a clustering y. โ Structural dependencies, since prediction has to be an equivalence relation. โ Correlation dependencies from interactions. y x General Problem: Predict Complex Outputs The policeman fed The policeman fed the cat. He did not know the cat. He did not know that he was late. that he was late. The cat is called Peter. The cat is called Peter. โ Given a 3D point cloud with RGB from Kinect camera โ Segment into volumes โ Geometric dependencies between segments (e.g. monitor usually close to keyboard) 3 Batch Learning for Classification โข Discriminative Wrap-Up โ โ โ โ โข Other Methods Decision Trees Perceptron Linear SVMs Kernel SVMs โข Generative โ โ โ โ โ Multinomial Naïve Bayes Multivariate Naïve Bayes Less Naïve Bayes Linear Discriminant Nearest Neighbor ๏ Methods + Theory + Practice Online Learning โข Expert Setting โข Other Methods โ Halving Algorithm โ Weighted Majority โ Exponentiated Gradient โข Bandit Setting โ EXP3 โ โ โ โ โ Hedge Follow the Leader UCB Zooming Partial Monitoring โข Other Settings โ Dueling Bandits โ Coactive Learning โข Clustering โ Hierarchical Agglomerative Clustering โ K-Means โ Mixture of Gaussians and EM-Algorithm โข Other Methods โ โ โ โ Spectral Clustering Latent Dirichlet Allocation Latent Semantic Analysis Multi-Dimensional Scaling โข Other Tasks โ โ โ โ Outlier Detection Novelty Detection Dimensionality Reduction Non-Linear Manifold Detection ๏ CS4786 Machine Learning for Data Science ๏ CS4850 Math Found for the Information Age Logical rule learning Stochastic Gradient Logistic Regression Neural Networks RBF Networks Boosting Bagging Parametric (Graphical) Models โ Non-Parametric Models โ *-Regression โ *-Multiclass Structured Prediction โข Discriminative โ Structural SVMs โข Generative โ Hidden Markov Model ๏ CS4786 Machine Learning for Data Science ๏ CS6783 Machine Learning Theory Unsupervised Learning โ โ โ โ โ โ โ โ โข Other Methods โ Maximum Margin Markov Networks โ Conditional Random Fields โ Markov Random Fields โ Bayesian Networks โ Statistical Relational Learning ๏ CS4786 Machine Learning for Data Science ๏ CS4782 Probabilistic Graphical Models Other Learning Problems and Applications โข Recommender Systems, Search Ranking, etc. โ CS4300 Information Retrieval โข Reinforcement Learning and Markov Decision Processes โ CS4758 Robot Learning โข Computer Vision โ CS4670 Intro Computer Vision โข Natural Language Processing โ CS4740 Intro Natural Language Processing 4 Other Machine Learning Courses at Cornell โข โข โข โข โข โข โข โข โข โข โข โข โข โข โข INFO 3300 - Data-Driven Web Pages CS 4700 - Introduction to Artificial Intelligence CS 4780/5780 - Machine Learning (for Intelligent Systems) CS 4786/5786 - Machine Learning for Data Science CS 4758 - Robot Learning CS 4782 - Probabilistic Graphical Models OR 4740 - Statistical Data Mining CS 6780 - Advanced Machine Learning CS 6783 - Machine Learning Theory CS 6784 - Advanced Topics in Machine Learning CS 6756 - Advanced Topics in Robot Learning ORIE 6740 - Statistical Learning Theory for Data Mining ORIE 6750 - Optimal learning ORIE 6780 - Bayesian Statistics and Data Analysis MATH 7740 - Statistical Learning Theory 5