Download Memory-Based Shallow Parsing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bayesian Sampling and Ensemble
Learning in Generative
Topographic Mapping
Akio Utsugi
National Institute of Bioscience and Human-technology,
Neural Processing Letters, vol. 12, no. 3, pp. 277-290.
Summarized by Jong-Youn, Lim
Introduction

SOM
 A minimal model for the formation of topology-preserving maps
 An information processing tool to extract a hidden smooth
manifold from data
 Drawbacks : no explicit statistical model for the data generation

Alternatives
 Elastic net
 Generative topographic mapping : based on the mixture of
spherical Gaussian generators with a constraint on the centroids
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)

Hyperparameter search of GTM on small data
using a Gibbs sampler, but time consuming on
large data
 Needs for deterministic algorithm producing the
estimates quickly – ensemble learning(to minimize
the variational free energy of the model, which
gives an upper bound of negative log evidence)
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
Generative topographic
mapping

Two versions : an original regression version , a
Gaussian process version
 It consists of a spherical Gaussian mixture density
and a Gaussian process prior
 A spherical Gaussian mixture density
f ( X | W ,  )   f ( X | Y ,W ,  ) f (Y )dY
  
f ( X | Y ,W ,  )   
 2 
1
f (Y )    
i 1 k 1  r 
n
r
yik
r
 r (2) f ( X | W ,  )  1   
r n  2 
n
yik
  
2 
exp

x

w

 (1)


i
k
 2

i 1 k 1 
nm / 2 n
2
 
exp

x

w

(3)

i
k
 2

i 1 k 1
nm / 2 n
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
r
 W has a Gaussian prior
f ( w | h)  (2 )
 rm / 2
|M |
m / 2
 1 '

1
exp

w
M
w


( j)
( j ) ( 4)
 2

j 1
m
 Bayesian inference of W
f (W | X , h)  f ( X ,W | h)  f ( X | W ,  ) f (W | h)(5)
 Inference of h is based on its evidence f(X|h) (the maximizer of the
evidence is called the generalized maximum likelihood(GML)
estimate of h)
 The approximations for the hyperparameter search algorithm are
valid only on abundant data
 Hyperparameter search is improved using a Gibbs sampler
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
Gibbs sampler in GTM

Any moment of the posteriors can be obtained
precisely by an average over the long sample series
 Gibbs sampler is one of MCMC methods, which
does not need the design of a trial distribution
 Conditional posterios on Y and W
 Conditional posterior on Y(p is the posterior selection
probabilities of the inner units
n
r
2
 
y
f (Y | X , W ,  )   pikik (6)
exp   xi  wk 
i 1 k 1
 2
 (7 )
pik 
r
2
 
k 1 exp   2 xi  wk 
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
 The conditional posterior on W is obtained by
normalizing f(X,Y,W|h) (product of 1,2,4)
m
f (W | X , Y , h)   N ( w( j ) |( j ) , )(8)
j 1
  ( N  M ) (9)
1 1
( j )    s( j ) (10)
n
N  diag (n1 ,..., nr )   diag ( yi1 ,..., yir )(11)
i 1
n
s( j )  ( s1 j ,..., srj )'   xij ( yi1 ,..., yir )'(12)
i 1
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)

Conditional posteriors on hyperparameters
M  (D' D  E ' E ) 1
f (W |  ,  )  (2 )  rm / 2  lm / 2 ( r l ) m / 2 | D' D |m / 2
 1 m
2
2
 exp   ( Dw( j )   Ew( j ) (16)
 2 j 1

f ( | d , s )  G( | d , s )(17)

s d x d 1
G( x | d , s) 
exp( sx)
( d )
Conditional posteriors are obtained by normalizing


f ( | X , Y ,W , H )  G ( | d , s )
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
Ensemble learning in GTM

The ensemble learning is a deterministic algorithm
to obtain the estimates of parameters and
hyperparameters concurrently
 Approximating ensemble density Q, and its
variational free energy on a model H
F (Q | H )   Q(Y ,W , ,  ) log

Q(Y ,W , ,  )
dYdWd d (27)
f ( X , Y ,W , ,  | H )
If we restrict Q to a factorial form, we can have a
straightforward algorithm for the minimization of
F
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)

The optimization procedure
1. Initial densities are set to the partial ensembles
Q( y ), Q(W ), Q( ), Q(  )
2. A new density of Q(Y) is obtained from other densities by


Q(Y )  exp  Q(W )Q( )Q(  ) log f ( X , Y ,W , ,  | H )dWdd (28)
3. Each of the other partial ensembles is also updated using the
same formula as above except that Y and the target variable
are exchanged
4. These updates of the partial ensembles are repeated until a
convergence condition is satisfied
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
Simulations

Compare the algorithms in simulations : the
ensemble learning(deterministic algorithm), the
Gibbs sampler
 Artificial data xi  ( xi1 , xi 2 )' , i = 1,..,n are generated
from two independent standard Gaussian random
series { ei1 }, { ei 2 } by
xi1  4(i  1) / n  2  ei1 (42)
xi 2  sin[ 2 (i  1) / n]  ei 2 (43)

Three noise levels :
  0.3,0.4,0.5
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
Conclusion

A simulation experiment showed the superiority of
the Gibbs sampler on small data and the validity
of the deterministic algorithms on large data
© 2001 SNU CSE Artificial Intelligence Lab (SCAI)
Related documents