* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Pattern Theory: the Mathematics of Perception
Survey
Document related concepts
Transcript
Pattern Theory: the Mathematics of Perception Prof. David Mumford Division of Applied Mathematics Brown University International Congress of Mathematics Beijing, 2002 Outline of talk I. Background: history, motivation, basic definitions II. A basic example – Hidden Markov Models and speech; and extensions III. The “natural degree of generality” – Markov Random Fields; and vision applications IV. Continuous models: image processing via PDE’s, self-similarity of images and random diffeomorphisms URL: www.dam.brown.edu/people/mumford/Papers /ICM02powerpoint.pdf or /ICM02proceedings.pdf Some History • Is there a mathematical theory underlying intelligence? • 40’s – Control theory (Wiener-Pontrjagin), the output side: driving a motor with noisy feedback in a noisy world to achieve a given state • 70’s – ARPA speech recognition program • 60’s-80’s – AI, esp. medical expert systems, modal, temporal, default and fuzzy logics and finally statistics • 80’s-90’s – Computer vision, autonomous land vehicle Statistics vs. Logic • Plato: “If Theodorus, or any other geometer, were prepared to rely on plausibility when he was doing geometry, he'd be worth absolutely nothing.” • Graunt – counting corpses in medieval London • Gauss – Gaussian distributions, least squares relocating lost Ceres from noisy incomplete data • Control theory – the Kalman-Wiener-Bucy filter • AI – Enhanced logics < Bayesian belief networks • Vision – Boolean combinations of features < Markov random fields What you perceive is not what you hear: ACTUAL SOUND 1. The ?eel is on the shoe 2. The ?eel is on the car 3. The ?eel is on the table 4. The ?eel is on the orange 1. 2. 3. 4. PERCEIVED WORDS The heel is on the shoe The wheel is on the car The meal is on the table The peel is on the orange (Warren & Warren, 1970) Statistical inference is being used! Why is this old man recognizable from a cursory glance? His outline is lost in clutter, shadows and wrinkles; except for one ear, his face is invisible. No known algorithm will find him. The Bayesian Setup, I VARIABLES: xo observable variables, xh hidden (not directly observable) variables, parameters in model MODEL: Pr( xo | xh , ).Pr( xh | ) or Pr( xh , ) ML PARAMETER ESTIMATION: arg max Pr( xˆ ( ) | ) or arg max Pr( xˆo( ) , xh | ) xh The Bayesian Setup, II BAYES’S RULE: Pr( xˆo | xh , ).Pr( xh | ) Pr( xh | xˆo , ) Pr( xˆo | ) Pr( xˆo | xh , ).Pr( xh | ) This is called the “posterior” distribution on xh •Sampling Pr(xo,xh|), “synthesis” is the acid test of the model The central problem of Statistical learning theory: • The complexity of the model and the Bias-Variance dilemma * Minimum Description Length—MDL, * Vapnik’s VC dimension A basic example: HMM’s and speech recognition I. Setup sk = sound in window around time kDt are the observables xk = part of phoneme being spoken at time kDt are the hidden vars. Pr( x , s ) p1 ( xk | xk 1 ). p2 (sk | xk ) k Sk-1 sk Sk+1 xk+1 xk-1 xk ’s = log’s of table of values of p1, p2, so 1 k k .Ek ( x , s ) Pr( x , s ) e Z A basic example: HMM’s and speech recognition II. Inference by dynamic programming: (a) p (x 1 Pr( xk | sˆ k ) k | xk 1 ). p2 ( sˆk | xk ).Pr( xk 1 | sˆ k 1 ) xk 1 numerator xk (b) Call maxp( xk ) max xk 1 Pr( xk , x k 1 , sˆ k ), then maxp( xk ) max xk 1 p1 ( xk | xk 1 ). p2 ( sˆk | xk ).maxp( xk 1 ) (c) Optimizing the ’s done by “EM” algorithm, valid for any exponential model Continuous and discrete variables in perception • Perception locks on to discrete labels, and the world is made up of discrete objects/events • Make an empirical histogram of changes x and compute the kurtosis: k =Exp(( x x )4 ) / 4 • High kurtosis is nature’s universal signal of discrete events/objects in space-time. • Stochastic process with i.i.d. increments has jumps iff the kurtosis k of its increments is > 3. A typical stochastic process with jumps Xt stochastic process with independent increments, then z X t 1 X t e z2 / 2 2 X t Brownian, hence continuous z X t 1 X t e z X t has jumps: Ex.: daily log-price changes in a sample of stocks Note fat power law tails N.B. vertical axis is log of probability Particle filtering • Compiling full conditional probability tables is usually impractical. •Use a weighted sample: Pr( xk | sˆk ) weak w i ,k x ( xk ) i ,k i • Bootstrap particle filtering: (a) Sample with replacement (b) Diffuse via p1 (c) Reweight via p2 • Tracking application (from A.Blake, M.Isard): Estimating the posterior distribution on optical flow in a movie (from M.Black) Horizontal flow (follow window in red) Horizontal flow Horizontal flow Horizontal flow Horizontal flow Horizontal flow No process is truly Markov • Speech has longer range patterns than phonemes: triphones, words, sentences, speech acts, … • PCFG’s = “probabilistic context free grammars” = almost surely finite, labeled, random branching processes: Forest of random trees Tn, labels xv on vertices, leaves in 1:1 corresp with observations sm, prob. p1(xvk|xv) on children, p2(sm|xm) on observations.). • Unfortunate fact: nature is not so obliging, longer range constraints force context-sensitive grammars. But how to make these stochastic?? Grammar in the parsed speech of Helen, a 2 ½ year old Grammar in images (G. Kanisza): contour completion Markov Random Fields: the natural degree of generality • Time linear structure of dependencies • space/space-time/abstract situations general graphical structure of dependencies G (V , E ), a graph, xv vV , the variables, EC ( xC ) 1 Pr( xV ) e C , Z C the cliques of G The Markov property: xv, xw are conditionally independent, given xS , if S separates v,w in G. Hammersley-Clifford: the converse. A simple MRF: the Ising model 1 Pr( x , s ) e ZT c. ( x s ) 2 sk+1,l-1 sk,l-1 sk-1,l-1 sk-1,l , x 1, s sk+1,l+1 sk,l+1 xk+1,l xk,l xk-1,l , sk-1,l+1 xk+1,l-1 xk-1,l-1 sk+1,l sk,l xk,l-1 x x T adj. xk+1,l+1 xk,l+1 xk-1,l+1 The Ising model and image segmentation A state-of-the-art image segmentation algorithm (S.-C. Zhu) Input Segmentation Synthesis from model I ~ p( I | W*) Hidden variables describe segments and their texture, allowing both slow and abrupt intensity and texture changes (See also Shi-Malik) Texture synthesis via MRF’s On left: a cheetah hide; In middle, a sample from the Gaussian model with identical second order statistics; On right, a sample from exponential model reproducing 7 filter marginals using: k (( Fk I )( x )) 1 Pr( I ) e k Z Monte Carlo Markov Chains Basic idea: use artificial thermal dynamics to find minimum energy (=maximum probability) states Bayesian belief propagation and the Bethe approximation • Can find modes of MRF’s on trees using dynamic programming •When the graph G is not a tree, use its universal covering graph G •The Bethe approximation = the 1(G)-invariant MRF on G ‘closest’ to the given MRF, ‘closest’ = minimizing Kullback-Liebler information distance •‘Bayesian belief propagation’ = finding the modes of the Bethe approximation with dynamic programming Continuous models I: deblurring and denoising • Observe noisy, blurred image I, seek to remove noise, enhance edges simultaneously! min c1 ( I J )2 dxdy c2 J D , D D p dxdy c3 • p=2 is Mumford-Shah model; • p=1, c3=0 is Osher-Rudin ‘TV’ model J J div c J t I J p • p=2 is Perona-Malik equation; c=0, p=1 is TV-gradient descent An example: Bela Bartok enhanced via the Nitzberg-Shiota filter Continuous models II: images and scaling • The statistics of images of ‘natural scenes’ appear to be a fixed point under blockaveraging renormalization, i.e. • Assume NN images of natural scenes have a certain probability distribution; form N/2N/2 images by a window or by 22 averages – get the same marginal distribution! Scale invariance has many implications: •Power law for the spectrum: Exp Iˆ( , ) c 2 2 2 • In the continuous limit, images are not locally integrable functions but generalized functions in: H • Intuitively, this is what we call ‘clutter’ – the mathematical explanation of why vision is hard Three axioms for natural images 1. Scale invariance 2. For all zero-mean filters F, the scalar random variables F I (i, j ) have kurtosis > 3 (D.Field, J.Huang). 3. Local image patches are dominated by ‘preferred geometries’: edges, bars, blobs as well as ‘blue sky’ blank patches (D.Marr, B.Julesz, A.Lee). It is not known if these axioms can be exactly satisfied! Empirical data on image filter responses Probability distributions of 1 and 2 filters, estimated from natural image data. a) Top plot is for values of horizontal first difference of pixel values; middle plot is for random 0-mean 8x8 filters. Vertical axis in top 2 plots is log(prob.density). b) Bottom plot shows level curves of Joint prob.density of vert.differences at two horizontally adjacent pixels. All are highly non-Gaussian! Mathematical models for random images I ( x, y ) k (erk x xk , e rk y yk ), k ( xk , yk , rk ) a Poisson process in k independent samples from 3 , prob.space of normalized wavelets. (Random wavelet model) I ( x, y ) k ( x , y ) ( e rk ( x , y ) x xk ( x , y ) , e rk ( x , y ) y yk ( x , y ) ), k ( x, y ) min k (erk x xk , e rk y yk ) supp( k ) (“Dead leaves” model) Continuous models III: random diffeomorphisms • The patterns of the world include shapes, structures which recur with distortions: e.g. alphanumeric characters, faces, anatomy • Thus the hidden variables must include (i) clusters of similar shapes, (ii) warpings between shapes in a cluster • Mathematically: need a metric on (i) the space of diffeomorphisms Gk of k, or (ii) the space of “shapes” Sk in k (open subsets with smooth bdry) Can use diffusion to define a probability measure on Gk . Metrics on Gk, I V.Arnold: Introduce a Riemannian metric in SGk, the group of volume preserving diffeos. For any path {t}, let length of path k t 1 vt ( x ) dx dt , vt ( x) t ( x) t 2 Then he proved geodesics are solutions of Euler’s equation: vt (vt .)vt p t Metrics on Gk, II Christensen, Rabbitt, Miller – on Gk, use stronger metric: k Lvt , vt dx, L ( I D)m vt = velocity, ut = Lvt = momentum in this metric Geodesics now are solutions to a regularized compressible form of Euler’s equation: ut (vt .)ut div(vt )ut (ut )i ((vt )i ) t i Note: linear in u, so u can be a generalized function! Geodesics in the quotient space S2 Get geodesics {Ht} on Sk by taking singular momentum: ut ( x ) a(t , x )nx , Ht Ht ( x ) S2 has remarkable structure: (A geodesic, F.Beg) 1. ‘Weak’ Hilbert manifold 2. ‘Medial axis’ gives it a cell decomposition 3. Geometric heat equation defines a deformation retraction 4. Diffusion defines probability measure (Dupuis-Grenander-Miller, Yip) Geodesics in the quotient space of ‘landmark points’ gives a classical mechanical system (Younes) Consider geodesics whose momentum has finite support: N ut ( x ) ui (t ) Pi (t ) ( x ) i 1 This gives the ODE in which particles traveling in the same (resp. opposite) direction attract (resp. repel): dPi 2 K Pi Pj u j dt j dui Pi K Pi Pj .(ui .u j ) dt j K the Green’s function of L, a Bessel function. Outlook for Pattern Theory Finding a rich class of stochastic models adequate for duplicating human perception yet tractable (vision remains a major challenge) Finding algorithms fast enough to make inferences with these models (Monte Carlo? BBP ? competing hypothesis particles?) Underpinnings for a better biological theory of neural functioning e.g. incorporating particle filtering? grammar? warping? feedback? URL: www.dam.brown.edu/people/mumford/Papers /ICM02powerpoint.pdf or /ICM02proceedings.pdf A sample of Graunt’s data