Download Memory-Based Shallow Parsing

Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping Akio Utsugi National Institute of Bioscience and Human-technology, Neural Processing Letters, vol. 12, no. 3, pp. 277-290. Summarized by Jong-Youn, Lim Introduction  SOM  A minimal model for the formation of topology-preserving maps  An information processing tool to extract a hidden smooth manifold from data  Drawbacks : no explicit statistical model for the data generation  Alternatives  Elastic net  Generative topographic mapping : based on the mixture of spherical Gaussian generators with a constraint on the centroids © 2001 SNU CSE Artificial Intelligence Lab (SCAI)  Hyperparameter search of GTM on small data using a Gibbs sampler, but time consuming on large data  Needs for deterministic algorithm producing the estimates quickly – ensemble learning(to minimize the variational free energy of the model, which gives an upper bound of negative log evidence) © 2001 SNU CSE Artificial Intelligence Lab (SCAI) Generative topographic mapping  Two versions : an original regression version , a Gaussian process version  It consists of a spherical Gaussian mixture density and a Gaussian process prior  A spherical Gaussian mixture density f ( X | W ,  )   f ( X | Y ,W ,  ) f (Y )dY    f ( X | Y ,W ,  )     2  1 f (Y )     i 1 k 1  r  n r yik r  r (2) f ( X | W ,  )  1    r n  2  n yik    2  exp  x  w   (1)   i k  2  i 1 k 1  nm / 2 n 2   exp  x  w  (3)  i k  2  i 1 k 1 nm / 2 n © 2001 SNU CSE Artificial Intelligence Lab (SCAI) r  W has a Gaussian prior f ( w | h)  (2 )  rm / 2 |M | m / 2  1 '  1 exp  w M w   ( j) ( j ) ( 4)  2  j 1 m  Bayesian inference of W f (W | X , h)  f ( X ,W | h)  f ( X | W ,  ) f (W | h)(5)  Inference of h is based on its evidence f(X|h) (the maximizer of the evidence is called the generalized maximum likelihood(GML) estimate of h)  The approximations for the hyperparameter search algorithm are valid only on abundant data  Hyperparameter search is improved using a Gibbs sampler © 2001 SNU CSE Artificial Intelligence Lab (SCAI) Gibbs sampler in GTM  Any moment of the posteriors can be obtained precisely by an average over the long sample series  Gibbs sampler is one of MCMC methods, which does not need the design of a trial distribution  Conditional posterios on Y and W  Conditional posterior on Y(p is the posterior selection probabilities of the inner units n r 2   y f (Y | X , W ,  )   pikik (6) exp   xi  wk  i 1 k 1  2  (7 ) pik  r 2   k 1 exp   2 xi  wk  © 2001 SNU CSE Artificial Intelligence Lab (SCAI)  The conditional posterior on W is obtained by normalizing f(X,Y,W|h) (product of 1,2,4) m f (W | X , Y , h)   N ( w( j ) |( j ) , )(8) j 1   ( N  M ) (9) 1 1 ( j )    s( j ) (10) n N  diag (n1 ,..., nr )   diag ( yi1 ,..., yir )(11) i 1 n s( j )  ( s1 j ,..., srj )'   xij ( yi1 ,..., yir )'(12) i 1 © 2001 SNU CSE Artificial Intelligence Lab (SCAI)  Conditional posteriors on hyperparameters M  (D' D  E ' E ) 1 f (W |  ,  )  (2 )  rm / 2  lm / 2 ( r l ) m / 2 | D' D |m / 2  1 m 2 2  exp   ( Dw( j )   Ew( j ) (16)  2 j 1  f ( | d , s )  G( | d , s )(17)  s d x d 1 G( x | d , s)  exp( sx) ( d ) Conditional posteriors are obtained by normalizing   f ( | X , Y ,W , H )  G ( | d , s ) © 2001 SNU CSE Artificial Intelligence Lab (SCAI) Ensemble learning in GTM  The ensemble learning is a deterministic algorithm to obtain the estimates of parameters and hyperparameters concurrently  Approximating ensemble density Q, and its variational free energy on a model H F (Q | H )   Q(Y ,W , ,  ) log  Q(Y ,W , ,  ) dYdWd d (27) f ( X , Y ,W , ,  | H ) If we restrict Q to a factorial form, we can have a straightforward algorithm for the minimization of F © 2001 SNU CSE Artificial Intelligence Lab (SCAI)  The optimization procedure 1. Initial densities are set to the partial ensembles Q( y ), Q(W ), Q( ), Q(  ) 2. A new density of Q(Y) is obtained from other densities by   Q(Y )  exp  Q(W )Q( )Q(  ) log f ( X , Y ,W , ,  | H )dWdd (28) 3. Each of the other partial ensembles is also updated using the same formula as above except that Y and the target variable are exchanged 4. These updates of the partial ensembles are repeated until a convergence condition is satisfied © 2001 SNU CSE Artificial Intelligence Lab (SCAI) Simulations  Compare the algorithms in simulations : the ensemble learning(deterministic algorithm), the Gibbs sampler  Artificial data xi  ( xi1 , xi 2 )' , i = 1,..,n are generated from two independent standard Gaussian random series { ei1 }, { ei 2 } by xi1  4(i  1) / n  2  ei1 (42) xi 2  sin[ 2 (i  1) / n]  ei 2 (43)  Three noise levels :   0.3,0.4,0.5 © 2001 SNU CSE Artificial Intelligence Lab (SCAI) © 2001 SNU CSE Artificial Intelligence Lab (SCAI) © 2001 SNU CSE Artificial Intelligence Lab (SCAI) © 2001 SNU CSE Artificial Intelligence Lab (SCAI) © 2001 SNU CSE Artificial Intelligence Lab (SCAI) Conclusion  A simulation experiment showed the superiority of the Gibbs sampler on small data and the validity of the deterministic algorithms on large data © 2001 SNU CSE Artificial Intelligence Lab (SCAI)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Memory-Based Shallow Parsing