Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Yilin Wang
11/5/2009
Background
 Labeling Problem
 Labeling: Observed data set (X)
Label set (L)
 Inferring the labels of the data points
 Most vision problems can be posed as labeling
problems
 Stereo matching
 Image segmentation
 Image restoration
Examples of Labeling Problem
 Stereo Matching
 For a pixel in image 1, where is the corresponding pixel
in image 2?
Label set: Differences (disparities)
between corresponding pixels
Picture source: S. Lazebnik
Examples of Labeling Problem
 Image Segmentation
 To partition an image
into multiple disjoint
regions.
Label set: Region IDs
Picture source: http://mmlab.ie.cuhk.edu.hk
Examples of Labeling Problem
 Image Restoration
 To "compensate for"
or "undo" defects
which degrade an
image.
Label set: Restored Intensities
Picture source: http://www.photorestoration.co.nz
Background
 Image Labeling
 Given an image, the system should automatically
partition it into semantically meaningful areas each
labeled with a specific object class
Cow
Sky
Lawn
Tree
Building
Plane
Image Labeling Problem
 Given
 X  {xi }iS : the observed data from an input image ,
th
x
i
where i is the data from site (pixel or block) of the
image set S
 A pre-defined label set
 Let L  {li }iS be the corresponding labels at the image
sites, we want to find proper L to maximize the
conditional probability P ( L | X ) :
L  arg max L P( L | X )
Which kinds of information can be used for labeling?
•Features from individual sites
Intensity, color, texture, …
•Interactions with neighboring sites
Contextual information
Sky or Building?
Vegetation
Two types of interactions
•Interaction with neighboring labels (Spatial smoothness of labels)
•neighboring sites tend to have similar labels(except at the discontinuities)
Sky
•Interactions with neighboring observed data
Sky
Building
Information for Image Labeling
 Let li be the label of the i th site of the image set S, and
Ni be the neighboring sites of site i
site i
S-{i}
Ni
 Three kinds of information for image labeling
Info(li )
 Features from local site
 Interaction with neighboring labels Info(li , l N )
 Interaction with neighboring observed data Info(li , x N )
i
i
Picture source: S. Xiang
Markov Random Fields (MRF)
 Markov Random Fields (MRFs) are the most popular
models to incorporate local contextual constraints in
labeling problems
th
l
i
i
 Let be the label of the site of the image set S, and
Ni be the neighboring sites of site i
The label set L (  {li }iS) is said to be a MRF on S w.r.t. a
neighborhood N iff the following condition is satisfied:
 Markov property:
P(li | lS {i} )  P(li | l N i )
Maintain global spatial consistency by only
considering relatively local dependencies !
Markov-Gibbs Equivalence
 Let l be a realization of L( {li }iS ) , then P(l) has an
explicit formulation (Gibbs distribution):
where
1
1
P(l )  exp(  E (l ))
Z
T
Z   exp( 
lL
Energy function
1
E (l ))
T
E (l )  Vc (l ) 
cC
Z: a normalizing factor, called the
partition function
T: a constant
Clique Ck={{i,i’,i’’,…}|i,i’,i’’,… are
neighbors to one another}
V (l )  V (l , l )  ...
{i}C1
1
i
{i ,i '}C2
2
i
i'
Potential functions represent a priori knowledge of interactions
between labels of neighboring sites
Auto-Model
 With clique potentials up to two sites, the energy
takes the form
E ( L)  V1 (li )   V2 (li , li ' )
iS
iS i 'Ni
 When V1 (li )  li Gi (li ) andV2 (li , li ' )   i ,i 'li li ' , where Gi(·) are
arbitrary functions (or constants) and  i ,i ' are constants
reflecting the pairwise interaction between i and i’, the
energy is
E ( L)   li Gi (li )    i ,i 'li li '
iS
Info(li )
iS i 'Ni
Info(li , l N i )
Such models are called auto-models (Besag 1974)
Parameter Estimation
 Give the functional form of the auto-model
E ( L)   li     li li '
iS
iS i 'N i
How to specify its parameters  ( { ,  }) ?
Maximum Likelihood Estimation
 Given a realization l of a MRF, the maximum
likelihood (ML) estimate maximizes the conditional
probability P(l | θ) (the likelihood of θ), that is:
   arg max  P(l |  )
 By Bayesian rules:
P( | l )  P(l |  ) P( )
 The prior P(θ) is assumed to be flat when the prior
information is totally unavailable. In this case, the
MAP estimation reduces to the ML estimation.
Maximum Likelihood Estimation
 The likelihood function is in the Gibbs form
1
p(l |  ) 
 exp(  E (l |  ))
Z ( )
where
Z ( )   exp(  E (l |  ))
lL
E (l |  )   li     li li '
iS
iS i 'N i
 However, the computation of Z(θ) is intractable even
for moderately sized problems because there are a
combinatorial number of elements in the
configuration space L.
Maximum Pseudo-Likelihood
 Assumption: li and l N are independent.
i
P(l |  )   P(li | l Ni , ) 
iS
iS
exp( li    li li ' )
i 'N i
1  exp(    li ' )
i 'N i
 Notice that the pseudo-likelihood does not involve the
partition function Z.
 {α, β} can be obtained by solving
 ln P(l |  ,  )
0
 ln P(l |  ,  )
0
Inference
 Recall that in image labeling, we want to find L such
that maximizes the posterior P ( L | X ) , by Bayesian
rules:
P( L | X )  P( L, X )  P( X | L) P( L)
Where prior probability: P( L)  Z 1 exp(  E ( L))
 Let P( X | L)  Z11 exp(  E ( X | L)) and
P( L | X )  exp(  E ( L | X ))  P( L, X )
then:
E ( L | X )  E ( X | L)  E ( L)
posterior energy
prior energy
likelihood energy
MAP-MRF Labeling
 Maximizing a posterior probability is equivalent to
minimizing the posterior energy:
L*  arg min L E ( L | X )
 Steps of MAP:
N
C
E ( L)   li     li li '
iS
iS i 'N i
E ( X | L)
E ( L | X )  E ( X | L)  E ( L)
Picture source: S. Xiang
MRF for Image Labeling
 Difficulties and disadvantages
 Very strict independence assumptions :
P(l |  )   P(li | l Ni , )
iS
 The interactions among label are modeled by the priori
term (P(L)), and are independent of the observation
data, which prohibits one from modeling datadependent interactions in labels.
Conditional Random Fields
 Let G = (S, E) be a graph, then (X, L) is said to be a
Conditional Random Field (CRF) if, when conditioned
on X, the random variables li obey the Markov property
with respect to the graph:
Compare with MRF:
P(li | X , lS {i} )  P(li | X , l Ni )
P(li | lS {i} )  P(li | l N i )
where S-{i} is the set of all sites in the graph except the
site i, Ni is the set of neighbors of the site i in G.
CRF
 According to the Markov-Gibbs equivalence, we have
1
1
P( L | X )  exp(  E ( L | X ))
Z
T
 If only up to pairwise clique potentials are nonzero, the
posterior probability P(L| X) has the form
1
P( L | X )  exp{V1 (li | X )  V2 (li , li ' | X )}
Z
iS
iS i 'N i
where −V1 and −V2 are called the association and
interaction potentials, respectively, in the CRF
literature
CRF vs. MRF
 MRF is a generative model(Two-step)
 Infer likelihood P ( X | L ) and prior P(L)
 Then use Bayes theorem to determine posterior P ( L | X )
 CRF is a discriminative model(One-step)
 Directly infer posterior P ( L | X )
CRF vs. MRF
 More differences between the CRF and MRF
 MRF: P( L) 
1
exp( V1 (li )   V2 (li , li ' ))
Z
iS
iS i 'N i
Info(li )
Info(li , l N i )
1
 CRF: P( L | X )  exp{V1 (li | X )  V2 (li , li ' | X )}
Z
iS
iS i 'N i
Info(li ) Info(li , xNi ) Info(li , l N i ) Info(li , xNi )
 In CRF, both Association and Interaction
Potentials are functions of all the observation data
as well as that of the labels
Discriminative Random Fields
 The Discriminative Random Field (DRF) is a special
type of CRF with two extensions.
 First, a DRF is defined over 2D lattices (such as the
image grid)
 Second, the unary (association) and pairwise
(interaction) potentials therein are designed using local
discriminative classifiers
Kumar, S. and M. Hebert: `Discriminative Random Fields: A Discriminative
Framework for Contextual Interaction in Classification'. ICCV 2003
DRF
 Formulation of DRF
1
P( L | X )  exp{ Ai (li , X )   I ii' (li , li ' , X )}
Z
iS
iS i 'N i
where Ai andI ii' are called association potential and
interaction potential
Picture source: S. Xiang
Association Potential
 A(li , X ) is modeled using a local discriminative
model that outputs the association of the site i
with class li as:
A(li , X )  log P' (li | f i ( X ))
where fi(.) is a linear function that maps an patch
centered at site i to a feature vector.
Picture source: S. Srihari
Association Potential
 For binary classification (li = -1 or 1), the
posterior at site i is modeled using a
logistic function:
1
T
P' (li  1 | f i ( X )) 
(
w
f i ( X ))
T
1  exp( ( w f i ( X )))
 Since li = -1 or 1, the probability can be
compactly expressed as:
P(li | X )   (li wT f i ( X ))
 Finally, the association potential is
defined as:
A(li , X )  log(  (li wT fi ( X )))
Picture source: S. Srihari
Interaction Potential
 The interaction potential can be seen as a measure of
how the labels at neighboring sites i and i' should
interact given the observed image X.
 Given the features at two different sites, a pairwise
discriminative model is defined as:
P' ' (tii' |  i ( X ), i ' ( X ))   (tii'v ii' ( X ))
T
 1
tii'  
 1
if li  li '
otherwise
where  k (X ) is a function that maps an patch centered
at site i to a feature vector, ii' ( i ( X ), i ' ( X )) is a new
feature vector, and v are model parameters
P' ' (tii' | i ( X ), i ' ( X )) is a measure of how likely site i and i’
have the same label given image X
Interaction Potential
 The interaction potential is modeled using data-dependent term
along with constant smoothing term
I (li , li ' , X )   {Kli li '  (1  K )( 2 (tii'vT ii' ( X ))  1)}
 The first term is a data-independent smoothing term,
similar to the auto-model
 The second term is a [-1, 1] mapping of the pairwise logistic
function , which ensures that both terms have the same range
 Ideally, the data-dependent term will act as a discontinuity
adaptive model that will moderate smoothing when the data
from two sites is 'different'.
Discussion of I(li,li’,X)
I (li , li ' , X )   {Kli li '  (1  K )( 2 (tii'vT ii' ( X ))  1)}
Suppose li '  a , i' N i , and li {a, b}
Also for simplicity, we assume A(li  a)  A(li  b)
Then
I (a, a, X )  I (a, b, X )
a
li  
I (a, a, X )  I (a, b, X )
b
 If only considering Kli li ' , li will never choose b. Oversmoothed!
 The second term is used to compensate the effect of the
smoothness assumption.
Parameter
Estimation
θ={w,v,β,K}
 Maximum likelihood estimation
  arg max  P(l | X , )
 In the conventional maximum-likelihood approach, the
evaluation of Z is an NP-hard problem.
 Approximate evaluation of partition function Z by
pseudo-likelihood M
  arg max   P(lim | l Nm , X m , )
Subject to 0≤K≤1
i
m 1 iS
where m indexes over the training images and M is the
total number of training images, and
1
P(li | l N , X , )  exp{ A(li , X )   I (li , li ' , X )}
zi
i 'N
zi   exp{ A(li , X )   I (li , li ' , X )}
i
i
li {1,1}
i 'Ni
Inference
 Objective function:
l  arg max l P(l | X )
 Iterated Conditional Modes (ICM) algorithm
 Given an initial label configuration, ICM maximizes the
local conditional probabilities iteratively, i.e.
li  arg max li P(li | l N i , X )
 ICM yields local maximum of the posterior and has been
shown to give reasonably good results
Experiment
 Task: detecting man-made structures in natural scenes
 Database
 Corel (training: 108 images, test: 129 images)
 Each image was divided in non-overlapping 16*16 pixels
blocks
 Compared methods
 Logistic
 MRF
 DRF
Experiment Results
 Detection Rates (DR) and False Positives (FP)
The DRF reduces
false positives from
the MRF by more
than 48%.
Superscript ‘-’ indicates no neighborhood data interaction was used. K = 0
indicates the absence of the data independent term in the interaction
potential in DRF.
Experiment Results
For similar
detection rate,
DRF has the
lower false
positives
Detection rate of
DRF is higher than
that of MRF for
similar false
positives
Conclusion of DRF
 Pros
 Provide the benefits of discriminative models
 Demonstrate good performance
 Cons
 Although the model outperforms traditional MRFs, it is
not strong enough to capture long range correlations
among the labels due to the rigid lattice based structure
which allows for only pairwise interactions
Problem
 Local information can be confused when there are
large overlaps between different classes
Sky or Water ?
Solution: utilizing the global contextual
information to improve the performance
Multiscale Conditional Random
Field (mCRF)
 Considering features in different scales
 Local Features (site)
 Regional Label Features (small patch)
 Global Label Features (big patch or the whole image)
 The conditional probability P(L|X) is formulated by
features in different scales s:
where
1
P( L | X )   Ps ( L | X )
Z s
Z   Ps ( L | X )
L
s
He, X., R. Zemel, and M. Carreira-Perpinan: 2004, `Multiscale conditional random
fields for image labelling'. IEEE Int. Conf. CVPR.
Local Features
 The local feature of site i is represented by the
outputs of several filters.
 The aim is to associate the patch with one of a
predefined set of labels.
Local Classifier
 Here a multilayer perceptron is used as the local
classifier.
 Independently at each site i, the local classifier
produces a conditional distribution over label variable
li given filter outputs xi within an image patch centered
on site (pixel) i:
where λ are the classifier parameters.
Regional Label Features
 Encoding a particular constraint between the
image and the labels within a region of the image
 Sample pattern: ground pixels (brown) above
water pixels (cyan)
Global Label Features
 Operate at a coarser resolution, specifying common
value for a patch of sites in the label field.
 Sample pattern: sky pixels (blue) at the top of the
image, hippo pixels (red) in the middle, and water
pixels (cyan) near the bottom.
Feature Function
 Global label features are trained by Restricted
Boltzmann Machines (RBM)
 two layers: label sites (L) and features (f)
 features and labels are fully inter-connected, with no
intra-layer connections
The joint distribution of the
global label feature model is:
f1
f2
fm
PG ( L, f )  exp{ f a waT L}
a
where wa is the parameter
connecting hidden global label
feature fa and label sites L
l1
l2
ln
Feature Function
 By marginalizing out the hidden variables (f), the
global component of the model is:
PG ( L)   (1  exp( waT L))
a
 Similarly, the regional component of the model can be
represented as:
PR ( L)   (1  exp( ubT lr ))
r ,b
 By multiplicatively combining component
conditional distributions:
1
P( L | X ,  )   PC (li | xi ,  )  (1  exp( ubT lr ))   (1  exp( waT L))
Z i
r ,b
a
  { ,{ub },{wa }}
Parameter Estimation and Inference
 Parameter Estimation
 The conditional model is trained discriminatively based
on the Conditional Maximum Likelihood (CML)
criterion, which maximizes the log conditional
likelihood:
 *  arg max   log P( Lt | X t ; )
t
 Inference
 Maximum Posterior Marginals (MPM):
li  arg max li P(li | X ),
*
i  S
Experiment Results
 Database
 Corel (100 images with 7 labels)
 Sowerby (104 images with 8 labels)
 Compared methods
 Single classifier (MLP)
 MRF
 mCRF
Labeling Results
Conclusion of mCRF
 Pros
 Formulating the image labeling problem into a
multiscale CRF model
 Combining the local and larger scale contextual
information in a unique framework
 Cons
 Including additional classifiers operating at different
scales into the mCRF framework introduces a large
number of model parameters
 The model assumes conditional independence of
hidden variables given the label field
More CRF models
 Hierarchical Conditional Random Field (HCRF)
 –S. Kumar and M. Hebert. A hierarchical field
framework for unified context-based classification. 2005
 –Jordan Reynolds and Kevin Murphy. Figure-ground
segmentation using a hierarchical conditional random
field. 2007
 Tree Structured Conditional Random Fields (TCRF)
 –P. Awasthi, A. Gagrani, and B. Ravindran, Image
Modeling using Tree Structured Conditional Random
Fields. 2007
Reference
 Li, S. Z.: 2009, `Markov Random Field Modeling in
Image Analysis’. Springer, 2009
 Kumar, S. and M. Hebert: 2003, `Discriminative
Random Fields: A Discriminative Framework for
Contextual Interaction in Classification'. in proc. IEEE
International Conference on Computer Vision (ICCV)
 He, X., R. Zemel, and M. Carreira-Perpinan: 2004,
`Multiscale conditional random fields for image
labelling'. IEEE Int. Conf. CVPR.
End
Thanks!