Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mean field particle methods wikipedia , lookup

System of linear equations wikipedia , lookup

False position method wikipedia , lookup

Least squares wikipedia , lookup

Transcript
Natural Discriminant Analysis Using
Interactive Potts Models
Neural computation 2002, April
Jiann-Ming Wu
[email protected]
Department of Applied Mathematics, National Donghwa University
Outlines
1. Discriminant analysis by learning samples
2. A generative model
3. PottsNDA
- The mapping function of PottsDA
- The learning network of PottsDA
- A mixed integer and linear programming
- A hybrid of mean field annealing and gradient
descent methods
- Interactive Dynamics
4. Incremental learning
5. Numerical simulations and discussion
6. Conclusions
Discriminant analysis by learning
samples
Training set
d=2
N=800
S={(0 1),(1 0)}
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
0
0
-0.2
-0.2
-0.4
-0.4
-0.6
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Testing set
d=2
N=800
S={(0 1),(1 0)}
-0.6
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Training set=
PottsDA
q=g(x;W)
Testing set
Correct rate
A generative model for parameters
1
p( x | y1 , A1 )
2
p( x | y2 , A2 )
p ( x | y k , Ak )
…
…
K
…
…
k
Flip
Coin
x
p( x | y K , AK )
A generative model
Piece-wise multivariate Gaussian distribution
Voronoi partition
Ωk = {x | arg minj‖x - yj‖A = k , x ∈ Rd }, where ‖x‖A denotes the
Mahalanobis distance of
Partition by the Mahalanobis distance
Partition by A=I
The mapping function of PottsDA
Category of each region:
ξk ∈ {e1M,…, eMM} or {( 0 1), (1 0)}, 1≦k≦K
The mapping function of PottsDA
• For an extreme beta value
or
where zk= Byk and A = B’B.
The learning Network of PottsDA
Potts model
for <>
dynamics
for A
dynamics
for y
inter-connection data networks
( A,{y} ,<>, <>)
Potts model
for <>
• Membership vectors:
δi ∈ {e1k,..., ekk}, 1≦i≦N
• Measurement A
• Kernels {yk, 1≦k≦K }
• Category of each region Ωk
ξk ∈ {e1M,…, eMM}, 1≦k≦K
Objective: fitness of the generative
model to all predictors
• Maximal likelihood
In terms of membership vectors
• Setting det(A-1) = - det(A) and neglecting the last constant term
• Maximizing the function l is equivalent to minimizing the
function E1
Objective: Minimal design cost
MILP: A mixed integer-linear
programming
• Objectives and constraints
Minimizing
subject to
A hybrid of mean field annealing and
gradient descent methods
• The gradient descent method can not be directly applied to
binary variables
• MFE for binary variables and GD for continuous variables
• {δi} and {ξk} are associated with Potts neural variables or
Potts spins in statistical mechanism
An energy function for the MILP
Mean field annealing
•
•
•
•
Fixing A and {yk}, trace the mean configuration〈〉and〈〉for emulating
thermal equilibrium at each temperature.
Probability of the system configuration is proportional to the Boltzmann
distribution
The annealing process gradually increases the parameter  from a sufficiently
low value to a large one
To a sufficiently large  value, the Boltzmann distribution is ultimately
dominated by the optimal configuration,
where
A free energy for interactive Potts
models
•A result of minimizing energy and maximizing entropy
where〈〉,〈〉, u , and v denote {i}, {k}, {ukm}, and {vik},
respectively, and ui and vk are auxiliary vectors.
Derivation of interactive dynamics
Mean field equations
Updating rule for each element of A
When all Amn = 0 , we have
Adaption of the kernels {yk}
• Gradient descent method
• Again when y = 0 , we have
The learning process of PottsDA
1.
2.
3.
4.
5.
6.
Set a sufficiently low  value, each kernel yk near the mean of all
predictors, and each〈ik〉near and〈km〉near .
Iteratively update all〈ik〉and vik by equation (20) and equation (21)
respectively, to a stationary point.
Iteratively update each〈km〉and ukm by equation (22) and equation
(23), respectively, to a stationary point.
Update each yi by equation (28).
Update A by equations (25) and (26).
If
and
are larger than a prior threshold, then halt,
otherwise increase  by an annealing schedule and then go to step 2.
Incremental Learning for Optimal Model
Size
• The variables {〈ik〉} and {〈km〉} are first recorded by temporal
variables {ik*}, { km*}, to establish an intermediate recognizing
network with kernels {yk}, and the matrix A.
• The size of {yk} is sufficient depends on the quantity of the design cost
, which denotes the number of errors of predicting all
training samples.
• Define the hit ratio r as
.
• Define the local hit ratio rk as
or each internal region
k = {x│k = arg minjx - yjA}, where Nk denotes the size of the set
{ xi k}.
Incremental Learning
• The kernel yk of an under-estimated internal region, such as yk <  , is
duplicated with small perturbation to produce its twin kernel denoted
by yk* .
• A set of new kernels {yknew} is created to be the union of {yk}, and
{yk* │rk <  }, having the model size of K + K* where K and K*
respectively denote the number of the original kernels and that of the
under-estimated internal regions.
• Let the index of yk be k still and that of yk* be denoted by k’ .
Incremental Learning
• For each input parameter xi , assuming xi ∈ Ω(yk) , a new membership
vector δinew, now with K + K* element, is created as follows:
In the second line of the above equation, the new vector δinew means
that xi has the same probability 1/2 to be mapped to each of the twins,
yknew and yk’new.
• Create new category responses {ξknew} for the new kernels {yknew} and
set each element of ξknew near 1/M.
Incremental Learning process
1.
2.
3.
4.
5.
Set a sufficiently low β value, a threshold θ, and an initial
model size K; set each kernel yk near the mean of all predictors,
each〈δik〉near 1/K , and〈ξkm〉near 1/M .
Iteratively update all 〈δik〉 and vik by equation (20) and (21),
respectively to a stationary point.
Iteratively update each〈ξkm〉and ukm by equation (22) and
equation (23) respectively to a stationary point.
Update each yi by equation (28).
Update A by equations (25) and (26).
Incremental Learning process (continue)
6.
7.
8.
9.
10.
If
and
are larger than a prior threshold, such as 0.98,
then go to step 7, otherwise increase β by an annealing schedule and
then go to step 2.
Record {〈δi〉}, {〈ξk〉} by temporal variables {δi*}, {ξk*},
respectively.
Determine r and all rk using A, {yk}, {δi*}, {ξk*} .
If r >θ, halt, otherwise duplicate the kernels of K* under-estimated
regions with small perturbation and create new variables {yknew},
{δinew}, {ξknew} using { yk}, {yk* | rk <θ}, {δi*}, {ξk*} as described in
the context.
K ← K + K*, increase β with a small constant, and replace {yk},
{〈δi〉} and {〈ξk〉} with {yknew}, {δinew} and {ξknew} respectively,
and goto step 2.
Numerical Simulations
• Performance evaluation of
1. PottsDA
2. Radial basis function(RBF) method
3. Support vector machine(SVM) method (Vapnik 1995)
Artificial data: Example 1
Mixing matrix
0.8
0.6
0.4
0.2
Results of PottsDA: the two columns
of the inverse of demixing matrix,
the four kernels, and their category
response.
0
-0.2
-0.4
-0.6
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Artificial data: Example 2
0.8
x(t) = Hs(t)
s(t) = [s¹(t) s²(t) s³(t)]’
0.6
0.4
0.2
•s¹(t) and s²(t), are of uniform
0
distributions within [-0.5,0.5]
-0.2
•s³(t) is a gaussian noise of N(0,√2)
-0.4
•Discriminant rule: sign(s¹(t)) *sign(s²(t))
-0.6
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
the third source as a noise for prediction.
Artificial data: Example 3
1.5
1
Results of PottsDA: the two columns
0.5
of the inverse of demixing matrix,
0
the 40 kernels, and their category
-0.5
response.
-1
-1.5
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Incremental learning of PottsDA for Example 3
•∑〈δik〉²
1
0.9
0.8
0.7
0.6
K=35
0.5
0.4
0.3
K=44
K=10
K=20
0.2
.
0.1 0
500
1000
1500
2000
2500
3000
The horizontal coordinate is the time index for varying the beta value
Discriminant analysis of
Wisconsin Breast Cancer Database
•Walberg and Mangasarian 1990
• 699 instances,
each containing 9 features for predicting one of
benign and malignant categories.
• 458 instances in the benign category
241 instances in the malignant category
Discriminant analysis of
Wisconsin Breast Cancer Database
FNA細胞樣本
Microscopy
PottsDA
Breast Cancer
Diagnosis
Benign
Or
Malignant
Feature extractor
Features:
clump thickness
uniformity of cell size
uniformity of cell shape
marginal adhesion
single epithelial cell size
bare nuclei
bland chromatin
normal nucleoli and mitoses
Simulation Results
• Walberg and Mangasarian 1990
error rate for testing > 6%
• 683 instances of the database by Malini Lamego(2001)
• For the 219-case test set, the RBF method with 80 kernels
and the SVM method result in error rates 4.17% and 4.63%
for testing.
Conclusions
• The PottsDA method is composed of a discriminant
network and an annealed recurrent learning network.
• The generative model is a potential approach to
characterizing predictors of a training set.
• A hybrid of mean field annealing and gradient descent
methods are reliable for solving a MILP.
• The incremental learning process is effective for optimal
model size.
• The performance of the PottsDA is encouraging.