Download Neural Network-Based Clustering

Document related concepts
no text concepts found
Transcript
Neural Network-Based
Clustering
A. Selçuk MERCANLI
Supervisor: Assist. Prof.Dr. Turgay İBRİKÇİ
1
Why NN?
• NN have solved a wide range of problems
and have a good learning capabilities.
Their strengths include adaptation, ease of
implementation, parallezition, speed and
flexibility. NN based clustering is closely
related to the consept of competitive
learning.
2
d
w: weigth, initialy random
k: # of clusters
s(x,wj)=
w
i 1
x
ji i
3
Updating Weights
w j (t  1)  w j (t )   ( x(t )  w j (t ))

: Learning rate.
İf it’s zero no learning, if it’s 1 fast learning
To avoid the problem of unlimited growth of the weight, the weight
vector must be normalized if the input pattern is normalized.
4
WTA - WTM
The competitive learning paradigm allows learning for a particular
winning neuron that matches best with the given input pattern. Thus,
it is also known as winner - take – all (WTA)
On the other hand, learning can also occur in a cooperative way,
which means that not just the winning neuron adjusts its prototype,
but all other cluster prototypes have the opportunity to be adapted
based on how proximate they are to the input pattern. The learning
scheme is called soft competitive learning or winner - take - most
(WTM)
- Hard competition
Only one neuron is activated
- Soft competition
Neurons neighboring the true winner are activated.
5
HARD COMPETITIVE LEARNING
CLUSTERING
•
•
•
•
Online K-means Algorithm
Leader Follower Clustering Algorithm
Adaptive Resonance Theory
Fuzzy ART
6
Online K-means Algorithm
1. Initialize K cluster prototype vectors, m1 , … , mK  ℜd
randomly;
2. Present a normalized input pattern x  ℜd ;
3. Choose the winner J that has the smallest Euclidean
distance to x ,
J =argmin ||x−mj ||;
4. Update the winning prototype vector towards x ,
mJ(new) =mJ(old)+η(x−mJ(old)),
where η is the learning rate;
5. Repeat steps 2 – 4 until the maximum number of steps
is reached.
7
K-means Algorithm
iterate {
Compute distance from all points to all kcenters
Assign each point to the nearest k-center
Compute the average of all points assigned to
all specific k-centers
Replace the k-centers with the new averages
}
From Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet
Summer 2007, Distributed Computing Seminar, p 12
8
Disadvantages of K-means
• In the begining of K-means algorithm, it
requires a certain determination of
number of clusters in advance. The
number of clusters must be estimated via
the procedure of cluster analysis. An
inappropriate selection of number of
clusters may distort the real clustering
structure, that’s why Leader follower is
needed.
9
Disadvantages of K-means
• η , the learning rate, will be very small in the
last stages that cause a disadvantage of not
learning very well of new patterns.
• where η0 and η1 are the initial and final values
of the learning rate, respectively, and t1 is the
maximum number of iterations allowed
10
Leader - Follower Clustering
Algorithm
1. Initialize the first cluster prototype vector m1 with the first input
pattern;
2. Present a normalized input pattern x ;
3. Choose the winner J that is closest to x based on the Euclidean
distance,
j=argmin ||x−mj ||;
4. If || x − mj || < θ , update the winning prototype vector,
mJ(new) =mJ(old)+η(x−mJ(old))
where η is the learning rate. Otherwise, create a new cluster with the
prototype vector equal to x ;
5. Repeat steps 2 – 4 until the maximum number of steps is reached.
11
Leader - Follower
Distance > Threshold
• Find the closest cluster center
– Distance above threshold ? Create new cluster
– Or else, add instance to cluster and update cluster
center
From Johan Everts, Clustering algorithms, Kunstmatige Intelligentie, p 31
12
Performance Analysis
• K-Means
– Depends a lot on a priori knowledge (K)
– Very Stable
• Leader Follower
– Depends a lot on a priori knowledge
(Threshold)
– Faster but unstable
From Johan Everts, Clustering algorithms, Kunstmatige Intelligentie, p 39
13
Adaptive Resonance Theory
• An important problem with competitive learning - based
clustering is stability. The stability of an incremental
clustering algorithm in terms of two conditions: “
• (1) No prototype vector can cycle, or take on a value that
it had at a previous time (provided it has changed in the
meantime).
• (2) Only a finite number of clusters are formed with
infinite presentation of the data. ” The first condition
considers the stability of individual prototype vectors of
the clusters, and the second one concentrates on the
stability of all the cluster vectors.
14
Adaptive Resonance Theory
• K-means and Leader Follower algorithms
dosen’t produce stable clusters. The plasticity of
two algorithms may cause lost of previously
learned rules.
• Adaptive resonance theory (ART) was
developed by Carpenter and Grossberg (1987a,
1988)
• ART is not, as is popularly imagined, a neural
network architecture. It is a learning theory
hypothesizing that resonance in neural circuits
can trigger fast learning.
15
Adaptive Resonance Theory
• Stability-Plasticity Dilemma
• Stability: system behaviour doesn’t change
after irrelevant events
• Plasticity: System adapts its behaviour
according to significant events
• Dilemma: how to achieve stability without
rigidity and plasticity without chaos?
– Ongoing learning capability
– Preservation of learned knowledge
From:Arash Ashari,Ali Mohammadi, ART powerpoint16
ART-1
• The basic ART1 architecture consists of
two - layer nodes or neurons, the feature
representation field F1 , and the category
representation field F2
• The neurons in layer F1 are activated by
the input pattern, while the prototypes of
the formed clusters are stored in layer F2 .
17
ART-1 Architecture
18
ART-1
• The two layers are connected via adaptive
weights:a bottom – up weight matrix and a top down weight matrix
• F2 performs a winner - take - all competition,
between a certain number of committed neurons
and one uncommitted neuron. The winning
neuron feeds back its template weights to layer
F1. This is known as top - down feedback
expectancy.This template is compared with the
input pattern
19
ART-1
• If the match meets the vigilance criterion, weight
adaptation occurs, where both bottom - up and top down weights are updated simultaneously. This
procedure is called resonance, which suggests the name
of ART. On the other hand, if the vigilance criterion is not
met, a reset signal is sent back to layer F2 to shut off the
current winning neuron.
• This new expectation is then projected into layer F1 ,
and this process repeats until the vigilance criterion is
met. If an uncommitted neuron is selected for coding, a
new uncommitted neuron is created to represent a
potential new cluster. It is clear that the vigilance
parameter ρ has a function similar to that of the
threshold parameter θ of the leader - follower algorithm.
20
ART-1 Flowchart
21
Fuzzy ART
• FA maintains architecture and operations similar
to ART1 while using the fuzzy set operators to
replace the binary operators so that it can work
for all real data sets. We describe FA by
emphasizing its main difference with ART1 in
terms of the following five phases, known as
preprocessing, initialization, category choice,
category match, and learning.
• Preprocessing. Each component of a d dimensional input pattern x = ( x1 , … , xd ) must
be in the interval [0,1].
22
Fuzzy ART
• Initialization. The real - valued adaptive
weights W = { w ij }, representing the
connection from the i th neuron in layer F2
to the j th neuron in layer F1 , include both
the bottom - up and top - down weights of
ART1. Initially, the weights of an
uncommitted node are set to one. Larger
values may also be used, however, this
will bias the tendency of the system to
select committed nodes
23
Fuzzy ART
• Category choice. After an input pattern is
presented, the nodes in layer F2 compete by
calculating the category choice function, defined
as
Tj=
| x  wj |
  | wj |
Where  is the fuzzy AND operator defined by
(x y)i= min (xi,yi) ,
24
Fuzzy ART
• Category match. The category match function
of the winning neuron is then tested with the
vigilance criterion. If
resonance occurs. Otherwise, the current
winning neuron is disabled and a new neuron in
layer F2 is selected and examined with the
vigilance criterion. This search process
continues until upper criteria sattisfied.
25
Fuzzy ART
• Learning. The weight vector of the winning
neuron that passes the vigilance test at the same
time is updated using the following learning rule,

: [0 1] learning rate parameter.
26
SOFT COMPETITIVE LEARNING
CLUSTERING
•
•
Leaky Learning,
One of the major problems with hard competitive learning is the
underutilized or dead neuron problem, which refers to the possibility that the
weight vector of a neuron is initialized farther away from any input patterns
than other weight vectors so that it has no opportunity to ever win the
competition and, therefore, no opportunity to be trained. One solution to
addressing this problem is to allow both winning and losing neurons to
move towards the presented input pattern, but with different learning rates.
•
where ηw and ηl are the learning rates for the winning and losing neurons,
respectively, and ηw >> ηl .
27
• Conscience Mechanism
we need to modify the distance definition described in upside. Desieno
(1988) adds a bias term bj to the squared Euclidean distance.
x : Data set
Wj : j=1,2,…K neurons weights
bj : Bias term
28
• Rival Penalized Competitive Learning
• x : Data set
• Wj : j=1,2,…K neurons weights
•  j : Bias term
29
Learning Vector Quantization
• Learning
vector
quantization
(LVQ)
,
(Kohonen1990)
is a unsupervised learning
pattern classification method. Essentially same
as the Kohonens SOM. LVQ algo is to find the
output unit that is closest to the input vector. If x
and wt belong to same class, then we move the
weights toward the new vector; if they belong to
different classes then we move the weights
away from this input vector.(Fundamentals of
Neural Networks, L.Fausett,)
30
Flowchart
of LVQ
X: input pattern
J(w,x) : cost function
w: weights
31
LVQ
J is the winning neuron and cost function defined on locally
weighted error between x and w
32
LVQ

: Prespesified threshold
33
LVQ Application
10 number of data clustered to two cluster and wieved by red and cyan colors
34
SOM
• A competitive network. Output neurons of
the network compete among themselves
to be activated or fired. Neighboorhood
function usually decrease by linear,
rectangular or hexagonal.
35
Neural Network a
Comprehensive
Foundation, Simon
Haykin , Prentice36
Hall,
p 467
SOM Neighboorhood
Application of neural Network and other Learning
Technologies in Process Engineering, I.M. Majtaba,
M.A. Hussain, Imperial College Press, 2001, P 53
37
SOM BMU
Best matching unit, Update
weights of winner and neighbours
Decrease learning rate &
neighbourhood size
38
Flowchart of SOFM
39
Basic steps of SOFM
• 1. Determine the topology of the SOFM. Initialize the
weight vectors w j (0) for j = 1, … , K , randomly;
• 2. Present an input pattern x to the network. Choose the
winning node J that has the minimum Euclidean distance
to x , i.e.
J=argmin(||x−wj||)
• 3. Calculate the current learning rate and size of the
neighborhood;
• 4. Update the weight vectors of all the neurons in the
neighborhood of J using wj(t+1)=wj(t)+(t)hji(t)(x-wj(t)) ;
• 5. Repeat steps 2 to 4 until the change of neuron
position is below a prespecified small positive number.
40
SOM Application
Learning A character
41
SOM Application
Learning circle with SOM
42
SOM Application
SOM Examples from Bernd Fritzke, Ruhr Univercity Draft 5 April 1997, p32
43
Neural Gas
NG is capable of adaptively determining
the updating of the neighborhood by using
a neighborhood ranking of the prototype
vectors within the input space, rather than
a neighborhood function in the output
lattice
44
Neural Gas
• h λ ( k j ( x, W )) is a bell - shaped curve
hλ (kj(x,W)) = exp(−kj(x,W) λ ).
• Prototype vectors are updated as
wj(t + 1) = wj(t) +η(t)hλ (kj(x,W))(x − wj(t)).
• Learning rate η and characteristic decay constant λ
• η0 and ηf : initial and final values
• λ0 and λf : initial and final decay constants
• T : maximum number of iteration
45
NG Algorithm
1.
2.
3.
4.
The major process of the NG algorithm is as follows:
Initialize a set of prototype vectors W = { w1 , w2 , … ,
wK } randomly;
Present an input pattern x to the network. Sort the index
list in order from the prototype vector with the smallest
Euclidean distance from x to the one with the greatest
distance from x ;
Calculate the current learning rate and hλ ( k j ( x, W ))
(bell shaped curve). Adjust the prototype vectors using
the learning rule
Repeat steps 2 and 3 until the maximum number of
iterations is reached.
46
NG Application
NG always adding new centers and stops when it reaches maxiteration
47
NG Application
NG Examples from Bernd Fritzke, Ruhr Univercity Draft 5 April 1997, p22
48
Growing Neural Gas
• A type of SOM. The neural gas is a simple
algorithm for finding optimal data
representations based on feature vectors.
The algorithm was coined "neural gas"
because of the dynamics of the feature
vectors during the adaptation process,
which distribute themselves like a gas
within the data space.
49
Growing Neural Gas
• When prototype learning occurs, not only is the
prototype vector of the winning neuron J1 updated
towards x , but the prototypes within its topological
neighborhood NJ1 are also adapted
• Different from NG, GCS, or SOFM, GNG is developed as
a self - organizing network that can dynamically increase
(usually) and remove the number of neurons in the
network. A succession of new neurons is inserted into
the network every λ iterations near the neuron with the
maximum accumulated error. At the same time, a neuron
removal rule could also be used to eliminate the neurons
featuring the lowest utility for error reduction
50
GNG
GNG Examples from Bernd Fritzke, Ruhr Univercity Draft 5 April 1997, p29
58
Some Applications
Magnetic Resonance Imaging Segmentation
MRI provides a visualization of the internal
tissues and organs in the living organism, which
is valuable in its applications in disease
diagnosis (such as cancer and heart and
vascular disease ), treatment and surgical
planning. MRI segmentation can be formulated
as a clustering problem in which a set of feature
vectors,
which
are
obtained
through
transforming
image
measurements
and
positions, is grouped into a relatively small
number of clusters.
59
Magnetic Resonance
Imaging Segmentation
• After the patient was
given Gadolinium, the
tumor on the T1 weighted image (Fig.
5.17 (d)) becomes
very bright and is
isolated
from
surrounding tissue.
From N. Karayiannis and P. Pai. Segmentation of magnetic resonance images
using fuzzy algorithms for learning vector quantization. IEEE Transactions on
Medical Imaging, vol. 18, pp. 172 – 180, 1999. Copyright © 1999 IEEE.)
60
Condition Monitoring of 3G
Cellular Networks
• The 3G mobile networks combine new technologies
such as WCDMA and UMTS and provide users with a
wide range of multimedia services and applications with
higher data rates (Laiho et al., 2005 ). At the same time,
emerging new requirements make it more important to
monitor the states and conditions of 3G cellular
networks. Specifically, in order to detect abnormal
behaviors in 3G cellular systems, four competitive
learning neural networks, LVQ, FSCL, SOFM (see
another application of SOFM in WCDMA network
analysis in Laiho et al. (2005) ), and NG, were applied to
generate abstractions or clustering prototypes of the
input vectors under normal conditions, which are further
used for network behavior prediction
61
Condition Monitoring of 3G
Cellular Networks
The clustering prototypes provide a good summary
of the normal behaviors of the cellular networks,
which can then be used to detect abnormalities.
62
Summary
Neural network – based clustering is tightly related to the
concept of competitive learning. Prototype vectors,
associated with a set of neurons in the network and
representing clusters in the feature or output space,
compete with each other upon the presentation of an
input pattern. The active neuron or winner reinforces
itself (hard competitive learning) or its neighborhood
within certain regions (soft competitive learning). More
often, the neighborhood decreases monotonically with
time.
One important problem that learning algorithms need to
deal with is the stability and plasticity dilemma. A system
should have the capability of learning new and important
patterns while maintaining stable cluster structures in
response to irrelevant inputs.
63