Download slides 6up

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
Discriminative vs. Generative
Discriminative Learning for
Structured Output Prediction
CS4780/5780 โ€“ Machine Learning
Fall 2014
Bayes Decision Rule
โ€“ โ„Ž๐‘๐‘Ž๐‘ฆ๐‘’๐‘  ๐‘ฅ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ƒ ๐‘Œ = ๐‘ฆ ๐‘‹ = ๐‘ฅ
= ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ƒ ๐‘‹ = ๐‘ฅ ๐‘Œ = ๐‘ฆ ๐‘ƒ ๐‘Œ = ๐‘ฆ
Generative:
โ€“ Idea: Make assumptions about ๐‘ƒ ๐‘‹ = ๐‘ฅ ๐‘Œ = ๐‘ฆ , ๐‘ƒ ๐‘Œ = ๐‘ฆ
โ€“ Method: Estimate parameters of the two distributions, then
apply Bayes decision rule.
Discriminative:
โ€“ Idea: Define set of prediction rules (i.e. hypotheses) H,
then search for โ„Ž โˆˆ ๐ป that best approximates
โ„Ž๐‘๐‘Ž๐‘ฆ๐‘’๐‘  ๐‘ฅ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ƒ ๐‘Œ = ๐‘ฆ ๐‘‹ = ๐‘ฅ
Thorsten Joachims
Cornell University
Reading:
T. Joachims, T. Hofmann, Yisong Yue, Chun-Nam Yu,
Predicting Structured Objects with Support Vector Machines,
Communications of the ACM, Research Highlight, 52(11):97-104, 2009.
http://www.cs.cornell.edu/people/tj/publications/joachims_etal_09b.pdf
Idea for Discriminative Training
of HMM
Bayes Rule
Idea:
โ€“ โ„Ž๐‘๐‘Ž๐‘ฆ๐‘’๐‘  ๐‘ฅ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ƒ ๐‘Œ = ๐‘ฆ ๐‘‹ = ๐‘ฅ
= ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ƒ ๐‘‹ = ๐‘ฅ ๐‘Œ = ๐‘ฆ ๐‘ƒ ๐‘Œ = ๐‘ฆ
โ€“ Model ๐‘ƒ ๐‘Œ = ๐‘ฆ ๐‘‹ = ๐‘ฅ with ๐‘ค โ‹… ๐œ™ ๐‘ฅ, ๐‘ฆ so that
๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ƒ ๐‘Œ = ๐‘ฆ ๐‘‹ = ๐‘ฅ
with ๐‘ค โˆˆ
Joint Feature Map for Sequences
Viterbi
โ€“ Each transition and emission has a weight
โ€“ Score of a sequence is the sum of its weights
โ€“ Find highest scoring sequence h(x) = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ค โ‹… ๐œ™ ๐‘ฅ, ๐‘ฆ
y Det
N
V
Det
N
The dog chased the
cat
Training HMMs with Structural SVM
โ€ข HMM
๐‘™
๐‘ƒ(๐‘ฅ, ๐‘ฆ) = ๐‘ƒ ๐‘ฆ1 ๐‘ƒ(๐‘ฅ1 |๐‘ฆ1 )
๐‘ƒ ๐‘ฅ๐‘– ๐‘ฆ๐‘– ๐‘ƒ ๐‘ฆ๐‘– ๐‘ฆ๐‘–โˆ’1 )
๐‘–=2
๐‘™
log ๐‘ƒ(๐‘ฅ, ๐‘ฆ) = ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฆ1 + ๐‘™๐‘œ๐‘”๐‘ƒ(๐‘ฅ1 |๐‘ฆ1 )
๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ๐‘– ๐‘ฆ๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฆ๐‘– ๐‘ฆ๐‘–โˆ’1 )
๐‘–=2
โ€ข Define ๐œ™ ๐‘ฅ, ๐‘ฆ so that model is isomorphic to HMM
โ„œ๐‘
โ€“ Tune ๐‘ค so that correct ๐‘ฆ has the highest value of
๐‘ค โ‹… ๐œ™ ๐‘ฅ, ๐‘ฆ
โ€“ ๐œ™ ๐‘ฅ, ๐‘ฆ is a feature vector that describes the match
between ๐‘ฅ and ๐‘ฆ
x The dog chased the cat
Question: Can we train HMMโ€™s discriminately?
= ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ค โ‹… ๐œ™ ๐‘ฅ, ๐‘ฆ
Hypothesis Space:
h(x) = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ค โ‹… ๐œ™ ๐‘ฅ, ๐‘ฆ
Intuition:
โ€ข Linear Chain HMM
โ€“ Method: find โ„Ž โˆˆ ๐ป that minimizes training error.
๏ƒฆ 2 ๏ƒถ Det ๏‚ฎ N
๏ƒง ๏ƒท
๏ƒง 0 ๏ƒท Det ๏‚ฎ V
๏ƒง1๏ƒท N ๏‚ฎ V
๏ƒง ๏ƒท
๏ƒง 1 ๏ƒท V ๏‚ฎ Det
๏ƒง๏๏ƒท
๏† (x, y ) ๏€ฝ ๏ƒง ๏ƒท
๏ƒง 0 ๏ƒท Det ๏‚ฎ dog
๏ƒง ๏ƒท
๏ƒง 2 ๏ƒท Det ๏‚ฎ the
๏ƒง 1 ๏ƒท N ๏‚ฎ dog
๏ƒง ๏ƒท
๏ƒง 1 ๏ƒท V ๏‚ฎ chased
๏ƒง 1 ๏ƒท N ๏‚ฎ cat
๏ƒจ ๏ƒธ
โ€“
โ€“
โ€“
โ€“
One feature for each possible start state
One feature for each possible transition
One feature for each possible output in each possible state
Feature values are counts
Structural Support Vector Machine
โ€ข Joint features ๐œ™ ๐‘ฅ, ๐‘ฆ describe match between x and y
โ€ข Learn weights ๐‘ค so that ๐‘ค โ‹… ๐œ™ ๐‘ฅ, ๐‘ฆ is max for correct y
โ€ฆ
1
Soft-Margin Structural SVM
Structural SVM Training Problem
Hard-margin optimization problem:
โ€ข Loss function ฮ” ๐‘ฆ๐‘– , ๐‘ฆ measures match
between target and prediction.
โ€ข Training Set: ๐‘ฅ1 , ๐‘ฆ1 , โ€ฆ , (๐‘ฅ๐‘› , ๐‘ฆ๐‘› )
โ€ข Prediction Rule: h๐‘ ๐‘ฃ๐‘š ๐‘ฅ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฆโˆˆY ๐‘ค โ‹… ๐œ™ ๐‘ฅ, ๐‘ฆ
โ€ข Optimization:
โ€ฆ
โ€“ Correct label yi must have higher value of ๐‘ค โ‹… ๐œ™ ๐‘ฅ, ๐‘ฆ than
any incorrect label y
โ€“ Find weight vector with smallest norm
Soft-Margin Structural SVM
Experiment: Part-of-Speech Tagging
โ€ข
Task
โ€“ Given a sequence of words x, predict sequence of tags y.
Soft-margin optimization problem:
x The dog chased the cat
y Det
N
V
Det
N
Lemma: The training loss is upper bounded by
Test Accuracy (%)
โ€“ Dependencies from tag-tag transitions in Markov model.
โ€ข Model
โ€“ Markov model with one state per tag and words as emissions
โ€“ Each word described by ~250,000 dimensional feature vector (all
word suffixes/prefixes, word length, capitalization โ€ฆ)
โ€ข Experiment (by Dan Fleisher)
โ€“ Train/test on 7966/1700 sentences from Penn Treebank
97.00
96.50
96.00
95.50
95.00
94.50
94.00
96.49
95.78
95.02
94.68
Brill (RBT)
HMM
(ACOPOST)
kNN (MBT)
Tree Tagger
SVM Multiclass
(SVM-light)
SVM-HMM
(SVM-struct)
Experiment:
Named Entity Recognition
NE Identification
โ€ข Identify all named locations, named persons,
named organizations, dates, times, monetary
amounts, and percentages.
95.75
95.63
โ€ข Data
โ€“ Spanish Newswire articles
โ€“ 300 training sentences
โ€“ 9 tags
โ€ข no-name,
โ€ข beginning and continuation of person name, organization,
location, misc name
โ€“ Output words are described by features (e.g. starts with
capital letter, contains number, etc.)
โ€ข Error on test set (% mislabeled tags):
โ€“ Generative HMM: 9.36%
โ€“ Support Vector Machine HMM: 5.08%
2
Cutting-Plane Algorithm for
Structural SVM
โ€ข Input:
โ€ข
โ€ข REPEAT
โ€“ FOR
โ€ข compute
โ€ข IF
โ€ข Supervised Learning from Examples
Violated
by more
than ๏ฅ ?
Find most
violated
constraint
โ€“ Find function from input space X to output space Y
โ„Ž: ๐‘‹ โ†’ ๐‘Œ
such that the prediction error is low.
โ€ข Typical
_
โ€“ Output space is just a single number
โ€“
optimize StructSVM over
โ€ข ENDIF
โ€“ ENDFOR
โ€ข UNTIL has not changed during iteration
Add constraint
to working set
โ€ข Classification: -1,+1
โ€ข Regression: some real number
โ€ข General
โ€“ Predict outputs that are complex objects
๏ƒ  Polynomial Time Algorithm (SVM-struct)
Examples of Complex Output Spaces
Examples of Complex Output Spaces
โ€ข Multi-Label Classification
โ€ข Natural Language Parsing
โ€“ Given a sequence of words x, predict the parse tree y.
โ€“ Dependencies from structural constraints, since y has to be
a tree.
y
S
NP
x The dog chased the cat
N
VP
V
โ€“ Given a (bag-of-words) document x, predict a set of
labels y.
โ€“ Dependencies between labels from correlations
between labels (โ€œiraqโ€ and โ€œoilโ€ in newswire corpus)
x Due to the continued violence
NP
Det
Det
N
Examples of Complex Output Spaces
โ€ข Noun-Phrase Co-reference
in Baghdad, the oil price is
expected to further increase.
OPEC officials met with โ€ฆ
y -1
-1
-1
+1
+1
-1
-1
-1
antarctica
benelux
germany
iraq
oil
coal
trade
acquisitions
Examples of Complex Output Spaces
โ€ข Scene Recognition
โ€“ Given a set of noun phrases x, predict a clustering y.
โ€“ Structural dependencies, since prediction has to be an
equivalence relation.
โ€“ Correlation dependencies from interactions.
y
x
General Problem:
Predict Complex Outputs
The policeman fed
The policeman fed
the cat. He did not know
the cat. He did not know
that he was late.
that he was late.
The cat is called Peter.
The cat is called Peter.
โ€“ Given a 3D point cloud with RGB from Kinect
camera
โ€“ Segment into volumes
โ€“ Geometric dependencies between segments (e.g.
monitor usually close to keyboard)
3
Batch Learning for Classification
โ€ข Discriminative
Wrap-Up
โ€“
โ€“
โ€“
โ€“
โ€ข Other Methods
Decision Trees
Perceptron
Linear SVMs
Kernel SVMs
โ€ข Generative
โ€“
โ€“
โ€“
โ€“
โ€“
Multinomial Naïve Bayes
Multivariate Naïve Bayes
Less Naïve Bayes
Linear Discriminant
Nearest Neighbor
๏ƒ Methods + Theory +
Practice
Online Learning
โ€ข Expert Setting
โ€ข Other Methods
โ€“ Halving Algorithm
โ€“ Weighted Majority
โ€“ Exponentiated Gradient
โ€ข Bandit Setting
โ€“ EXP3
โ€“
โ€“
โ€“
โ€“
โ€“
Hedge
Follow the Leader
UCB
Zooming
Partial Monitoring
โ€ข Other Settings
โ€“ Dueling Bandits
โ€“ Coactive Learning
โ€ข Clustering
โ€“ Hierarchical Agglomerative
Clustering
โ€“ K-Means
โ€“ Mixture of Gaussians and
EM-Algorithm
โ€ข Other Methods
โ€“
โ€“
โ€“
โ€“
Spectral Clustering
Latent Dirichlet Allocation
Latent Semantic Analysis
Multi-Dimensional Scaling
โ€ข Other Tasks
โ€“
โ€“
โ€“
โ€“
Outlier Detection
Novelty Detection
Dimensionality Reduction
Non-Linear Manifold
Detection
๏ƒ  CS4786 Machine Learning for Data
Science
๏ƒ  CS4850 Math Found for the
Information Age
Logical rule learning
Stochastic Gradient
Logistic Regression
Neural Networks
RBF Networks
Boosting
Bagging
Parametric (Graphical)
Models
โ€“ Non-Parametric Models
โ€“ *-Regression
โ€“ *-Multiclass
Structured Prediction
โ€ข Discriminative
โ€“ Structural SVMs
โ€ข Generative
โ€“ Hidden Markov Model
๏ƒ  CS4786 Machine Learning for Data
Science
๏ƒ  CS6783 Machine Learning Theory
Unsupervised Learning
โ€“
โ€“
โ€“
โ€“
โ€“
โ€“
โ€“
โ€“
โ€ข Other Methods
โ€“ Maximum Margin
Markov Networks
โ€“ Conditional Random
Fields
โ€“ Markov Random Fields
โ€“ Bayesian Networks
โ€“ Statistical Relational
Learning
๏ƒ  CS4786 Machine Learning for Data
Science
๏ƒ  CS4782 Probabilistic Graphical
Models
Other Learning Problems and
Applications
โ€ข Recommender Systems, Search Ranking, etc.
โ€“ CS4300 Information Retrieval
โ€ข Reinforcement Learning and Markov Decision
Processes
โ€“ CS4758 Robot Learning
โ€ข Computer Vision
โ€“ CS4670 Intro Computer Vision
โ€ข Natural Language Processing
โ€“ CS4740 Intro Natural Language Processing
4
Other Machine Learning Courses
at Cornell
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
โ€ข
INFO 3300 - Data-Driven Web Pages
CS 4700 - Introduction to Artificial Intelligence
CS 4780/5780 - Machine Learning (for Intelligent Systems)
CS 4786/5786 - Machine Learning for Data Science
CS 4758 - Robot Learning
CS 4782 - Probabilistic Graphical Models
OR 4740 - Statistical Data Mining
CS 6780 - Advanced Machine Learning
CS 6783 - Machine Learning Theory
CS 6784 - Advanced Topics in Machine Learning
CS 6756 - Advanced Topics in Robot Learning
ORIE 6740 - Statistical Learning Theory for Data Mining
ORIE 6750 - Optimal learning
ORIE 6780 - Bayesian Statistics and Data Analysis
MATH 7740 - Statistical Learning Theory
5