Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Yilin Wang 11/5/2009 Background Labeling Problem Labeling: Observed data set (X) Label set (L) Inferring the labels of the data points Most vision problems can be posed as labeling problems Stereo matching Image segmentation Image restoration Examples of Labeling Problem Stereo Matching For a pixel in image 1, where is the corresponding pixel in image 2? Label set: Differences (disparities) between corresponding pixels Picture source: S. Lazebnik Examples of Labeling Problem Image Segmentation To partition an image into multiple disjoint regions. Label set: Region IDs Picture source: http://mmlab.ie.cuhk.edu.hk Examples of Labeling Problem Image Restoration To "compensate for" or "undo" defects which degrade an image. Label set: Restored Intensities Picture source: http://www.photorestoration.co.nz Background Image Labeling Given an image, the system should automatically partition it into semantically meaningful areas each labeled with a specific object class Cow Sky Lawn Tree Building Plane Image Labeling Problem Given X {xi }iS : the observed data from an input image , th x i where i is the data from site (pixel or block) of the image set S A pre-defined label set Let L {li }iS be the corresponding labels at the image sites, we want to find proper L to maximize the conditional probability P ( L | X ) : L arg max L P( L | X ) Which kinds of information can be used for labeling? •Features from individual sites Intensity, color, texture, … •Interactions with neighboring sites Contextual information Sky or Building? Vegetation Two types of interactions •Interaction with neighboring labels (Spatial smoothness of labels) •neighboring sites tend to have similar labels(except at the discontinuities) Sky •Interactions with neighboring observed data Sky Building Information for Image Labeling Let li be the label of the i th site of the image set S, and Ni be the neighboring sites of site i site i S-{i} Ni Three kinds of information for image labeling Info(li ) Features from local site Interaction with neighboring labels Info(li , l N ) Interaction with neighboring observed data Info(li , x N ) i i Picture source: S. Xiang Markov Random Fields (MRF) Markov Random Fields (MRFs) are the most popular models to incorporate local contextual constraints in labeling problems th l i i Let be the label of the site of the image set S, and Ni be the neighboring sites of site i The label set L ( {li }iS) is said to be a MRF on S w.r.t. a neighborhood N iff the following condition is satisfied: Markov property: P(li | lS {i} ) P(li | l N i ) Maintain global spatial consistency by only considering relatively local dependencies ! Markov-Gibbs Equivalence Let l be a realization of L( {li }iS ) , then P(l) has an explicit formulation (Gibbs distribution): where 1 1 P(l ) exp( E (l )) Z T Z exp( lL Energy function 1 E (l )) T E (l ) Vc (l ) cC Z: a normalizing factor, called the partition function T: a constant Clique Ck={{i,i’,i’’,…}|i,i’,i’’,… are neighbors to one another} V (l ) V (l , l ) ... {i}C1 1 i {i ,i '}C2 2 i i' Potential functions represent a priori knowledge of interactions between labels of neighboring sites Auto-Model With clique potentials up to two sites, the energy takes the form E ( L) V1 (li ) V2 (li , li ' ) iS iS i 'Ni When V1 (li ) li Gi (li ) andV2 (li , li ' ) i ,i 'li li ' , where Gi(·) are arbitrary functions (or constants) and i ,i ' are constants reflecting the pairwise interaction between i and i’, the energy is E ( L) li Gi (li ) i ,i 'li li ' iS Info(li ) iS i 'Ni Info(li , l N i ) Such models are called auto-models (Besag 1974) Parameter Estimation Give the functional form of the auto-model E ( L) li li li ' iS iS i 'N i How to specify its parameters ( { , }) ? Maximum Likelihood Estimation Given a realization l of a MRF, the maximum likelihood (ML) estimate maximizes the conditional probability P(l | θ) (the likelihood of θ), that is: arg max P(l | ) By Bayesian rules: P( | l ) P(l | ) P( ) The prior P(θ) is assumed to be flat when the prior information is totally unavailable. In this case, the MAP estimation reduces to the ML estimation. Maximum Likelihood Estimation The likelihood function is in the Gibbs form 1 p(l | ) exp( E (l | )) Z ( ) where Z ( ) exp( E (l | )) lL E (l | ) li li li ' iS iS i 'N i However, the computation of Z(θ) is intractable even for moderately sized problems because there are a combinatorial number of elements in the configuration space L. Maximum Pseudo-Likelihood Assumption: li and l N are independent. i P(l | ) P(li | l Ni , ) iS iS exp( li li li ' ) i 'N i 1 exp( li ' ) i 'N i Notice that the pseudo-likelihood does not involve the partition function Z. {α, β} can be obtained by solving ln P(l | , ) 0 ln P(l | , ) 0 Inference Recall that in image labeling, we want to find L such that maximizes the posterior P ( L | X ) , by Bayesian rules: P( L | X ) P( L, X ) P( X | L) P( L) Where prior probability: P( L) Z 1 exp( E ( L)) Let P( X | L) Z11 exp( E ( X | L)) and P( L | X ) exp( E ( L | X )) P( L, X ) then: E ( L | X ) E ( X | L) E ( L) posterior energy prior energy likelihood energy MAP-MRF Labeling Maximizing a posterior probability is equivalent to minimizing the posterior energy: L* arg min L E ( L | X ) Steps of MAP: N C E ( L) li li li ' iS iS i 'N i E ( X | L) E ( L | X ) E ( X | L) E ( L) Picture source: S. Xiang MRF for Image Labeling Difficulties and disadvantages Very strict independence assumptions : P(l | ) P(li | l Ni , ) iS The interactions among label are modeled by the priori term (P(L)), and are independent of the observation data, which prohibits one from modeling datadependent interactions in labels. Conditional Random Fields Let G = (S, E) be a graph, then (X, L) is said to be a Conditional Random Field (CRF) if, when conditioned on X, the random variables li obey the Markov property with respect to the graph: Compare with MRF: P(li | X , lS {i} ) P(li | X , l Ni ) P(li | lS {i} ) P(li | l N i ) where S-{i} is the set of all sites in the graph except the site i, Ni is the set of neighbors of the site i in G. CRF According to the Markov-Gibbs equivalence, we have 1 1 P( L | X ) exp( E ( L | X )) Z T If only up to pairwise clique potentials are nonzero, the posterior probability P(L| X) has the form 1 P( L | X ) exp{V1 (li | X ) V2 (li , li ' | X )} Z iS iS i 'N i where −V1 and −V2 are called the association and interaction potentials, respectively, in the CRF literature CRF vs. MRF MRF is a generative model(Two-step) Infer likelihood P ( X | L ) and prior P(L) Then use Bayes theorem to determine posterior P ( L | X ) CRF is a discriminative model(One-step) Directly infer posterior P ( L | X ) CRF vs. MRF More differences between the CRF and MRF MRF: P( L) 1 exp( V1 (li ) V2 (li , li ' )) Z iS iS i 'N i Info(li ) Info(li , l N i ) 1 CRF: P( L | X ) exp{V1 (li | X ) V2 (li , li ' | X )} Z iS iS i 'N i Info(li ) Info(li , xNi ) Info(li , l N i ) Info(li , xNi ) In CRF, both Association and Interaction Potentials are functions of all the observation data as well as that of the labels Discriminative Random Fields The Discriminative Random Field (DRF) is a special type of CRF with two extensions. First, a DRF is defined over 2D lattices (such as the image grid) Second, the unary (association) and pairwise (interaction) potentials therein are designed using local discriminative classifiers Kumar, S. and M. Hebert: `Discriminative Random Fields: A Discriminative Framework for Contextual Interaction in Classification'. ICCV 2003 DRF Formulation of DRF 1 P( L | X ) exp{ Ai (li , X ) I ii' (li , li ' , X )} Z iS iS i 'N i where Ai andI ii' are called association potential and interaction potential Picture source: S. Xiang Association Potential A(li , X ) is modeled using a local discriminative model that outputs the association of the site i with class li as: A(li , X ) log P' (li | f i ( X )) where fi(.) is a linear function that maps an patch centered at site i to a feature vector. Picture source: S. Srihari Association Potential For binary classification (li = -1 or 1), the posterior at site i is modeled using a logistic function: 1 T P' (li 1 | f i ( X )) ( w f i ( X )) T 1 exp( ( w f i ( X ))) Since li = -1 or 1, the probability can be compactly expressed as: P(li | X ) (li wT f i ( X )) Finally, the association potential is defined as: A(li , X ) log( (li wT fi ( X ))) Picture source: S. Srihari Interaction Potential The interaction potential can be seen as a measure of how the labels at neighboring sites i and i' should interact given the observed image X. Given the features at two different sites, a pairwise discriminative model is defined as: P' ' (tii' | i ( X ), i ' ( X )) (tii'v ii' ( X )) T 1 tii' 1 if li li ' otherwise where k (X ) is a function that maps an patch centered at site i to a feature vector, ii' ( i ( X ), i ' ( X )) is a new feature vector, and v are model parameters P' ' (tii' | i ( X ), i ' ( X )) is a measure of how likely site i and i’ have the same label given image X Interaction Potential The interaction potential is modeled using data-dependent term along with constant smoothing term I (li , li ' , X ) {Kli li ' (1 K )( 2 (tii'vT ii' ( X )) 1)} The first term is a data-independent smoothing term, similar to the auto-model The second term is a [-1, 1] mapping of the pairwise logistic function , which ensures that both terms have the same range Ideally, the data-dependent term will act as a discontinuity adaptive model that will moderate smoothing when the data from two sites is 'different'. Discussion of I(li,li’,X) I (li , li ' , X ) {Kli li ' (1 K )( 2 (tii'vT ii' ( X )) 1)} Suppose li ' a , i' N i , and li {a, b} Also for simplicity, we assume A(li a) A(li b) Then I (a, a, X ) I (a, b, X ) a li I (a, a, X ) I (a, b, X ) b If only considering Kli li ' , li will never choose b. Oversmoothed! The second term is used to compensate the effect of the smoothness assumption. Parameter Estimation θ={w,v,β,K} Maximum likelihood estimation arg max P(l | X , ) In the conventional maximum-likelihood approach, the evaluation of Z is an NP-hard problem. Approximate evaluation of partition function Z by pseudo-likelihood M arg max P(lim | l Nm , X m , ) Subject to 0≤K≤1 i m 1 iS where m indexes over the training images and M is the total number of training images, and 1 P(li | l N , X , ) exp{ A(li , X ) I (li , li ' , X )} zi i 'N zi exp{ A(li , X ) I (li , li ' , X )} i i li {1,1} i 'Ni Inference Objective function: l arg max l P(l | X ) Iterated Conditional Modes (ICM) algorithm Given an initial label configuration, ICM maximizes the local conditional probabilities iteratively, i.e. li arg max li P(li | l N i , X ) ICM yields local maximum of the posterior and has been shown to give reasonably good results Experiment Task: detecting man-made structures in natural scenes Database Corel (training: 108 images, test: 129 images) Each image was divided in non-overlapping 16*16 pixels blocks Compared methods Logistic MRF DRF Experiment Results Detection Rates (DR) and False Positives (FP) The DRF reduces false positives from the MRF by more than 48%. Superscript ‘-’ indicates no neighborhood data interaction was used. K = 0 indicates the absence of the data independent term in the interaction potential in DRF. Experiment Results For similar detection rate, DRF has the lower false positives Detection rate of DRF is higher than that of MRF for similar false positives Conclusion of DRF Pros Provide the benefits of discriminative models Demonstrate good performance Cons Although the model outperforms traditional MRFs, it is not strong enough to capture long range correlations among the labels due to the rigid lattice based structure which allows for only pairwise interactions Problem Local information can be confused when there are large overlaps between different classes Sky or Water ? Solution: utilizing the global contextual information to improve the performance Multiscale Conditional Random Field (mCRF) Considering features in different scales Local Features (site) Regional Label Features (small patch) Global Label Features (big patch or the whole image) The conditional probability P(L|X) is formulated by features in different scales s: where 1 P( L | X ) Ps ( L | X ) Z s Z Ps ( L | X ) L s He, X., R. Zemel, and M. Carreira-Perpinan: 2004, `Multiscale conditional random fields for image labelling'. IEEE Int. Conf. CVPR. Local Features The local feature of site i is represented by the outputs of several filters. The aim is to associate the patch with one of a predefined set of labels. Local Classifier Here a multilayer perceptron is used as the local classifier. Independently at each site i, the local classifier produces a conditional distribution over label variable li given filter outputs xi within an image patch centered on site (pixel) i: where λ are the classifier parameters. Regional Label Features Encoding a particular constraint between the image and the labels within a region of the image Sample pattern: ground pixels (brown) above water pixels (cyan) Global Label Features Operate at a coarser resolution, specifying common value for a patch of sites in the label field. Sample pattern: sky pixels (blue) at the top of the image, hippo pixels (red) in the middle, and water pixels (cyan) near the bottom. Feature Function Global label features are trained by Restricted Boltzmann Machines (RBM) two layers: label sites (L) and features (f) features and labels are fully inter-connected, with no intra-layer connections The joint distribution of the global label feature model is: f1 f2 fm PG ( L, f ) exp{ f a waT L} a where wa is the parameter connecting hidden global label feature fa and label sites L l1 l2 ln Feature Function By marginalizing out the hidden variables (f), the global component of the model is: PG ( L) (1 exp( waT L)) a Similarly, the regional component of the model can be represented as: PR ( L) (1 exp( ubT lr )) r ,b By multiplicatively combining component conditional distributions: 1 P( L | X , ) PC (li | xi , ) (1 exp( ubT lr )) (1 exp( waT L)) Z i r ,b a { ,{ub },{wa }} Parameter Estimation and Inference Parameter Estimation The conditional model is trained discriminatively based on the Conditional Maximum Likelihood (CML) criterion, which maximizes the log conditional likelihood: * arg max log P( Lt | X t ; ) t Inference Maximum Posterior Marginals (MPM): li arg max li P(li | X ), * i S Experiment Results Database Corel (100 images with 7 labels) Sowerby (104 images with 8 labels) Compared methods Single classifier (MLP) MRF mCRF Labeling Results Conclusion of mCRF Pros Formulating the image labeling problem into a multiscale CRF model Combining the local and larger scale contextual information in a unique framework Cons Including additional classifiers operating at different scales into the mCRF framework introduces a large number of model parameters The model assumes conditional independence of hidden variables given the label field More CRF models Hierarchical Conditional Random Field (HCRF) –S. Kumar and M. Hebert. A hierarchical field framework for unified context-based classification. 2005 –Jordan Reynolds and Kevin Murphy. Figure-ground segmentation using a hierarchical conditional random field. 2007 Tree Structured Conditional Random Fields (TCRF) –P. Awasthi, A. Gagrani, and B. Ravindran, Image Modeling using Tree Structured Conditional Random Fields. 2007 Reference Li, S. Z.: 2009, `Markov Random Field Modeling in Image Analysis’. Springer, 2009 Kumar, S. and M. Hebert: 2003, `Discriminative Random Fields: A Discriminative Framework for Contextual Interaction in Classification'. in proc. IEEE International Conference on Computer Vision (ICCV) He, X., R. Zemel, and M. Carreira-Perpinan: 2004, `Multiscale conditional random fields for image labelling'. IEEE Int. Conf. CVPR. End Thanks!