Download CIClassCh06

Document related concepts

Piggybacking (Internet access) wikipedia , lookup

Computer network wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Network tap wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
Chapter 6
Neural Network
Implementations
Neural Network Implementations
Back-propagation networks
Learning vector quantizer networks
Kohonen self-organizing feature map networks
Evolutionary multi-layer perceptron networks
The Iris Data Set
Consists of 150 four-dimensional vectors (50 plants of each of three
Iris species) xi  ( xi1 , xi 2 , xi 3 , xi 4 )
i  1,,150
Features are: sepal length, sepal width, petal length and petal width
We are working with scaled values in the range [0,1]
Examples of patterns:
0.637500
0.437500
0.875000
0.400000
0.787500
0.412500
0.175000
0.587500
0.750000
0.025000
0.175000
0.312500
1
0
0
0
1
0
0
0
1
Implementation Issues
•Topology
•Network initialization and normalization
•Feedforward calculations
•Supervised adaptation versus unsupervised adaptation
•Issues in evolving neural networks
Topology
•Pattern of PEs and interconnections
•Direction of data flow
•PE activation functions
Back-propagation uses at least three layers; LVQ and SOFM use two.
Definition: Neural Network
Architecture
Specifications sufficient to build, train, test, and
operate a neural network
Back-propagation Networks
•Software on web site
•Topology
•Network input
•Feedforward calculations
•Training
•Choosing network parameters
•Running the implementation
Elements of an artificial neuron
(PE)
•Set of connection weights
•Linear combiner
•Activation function
Back-propagation Network Structure
Back-propagation network input
•Number of inputs depends on application
•Don’t combine parameters unnecessarily
•Inputs usually over range [0,1], continuous valued
•Type float in C++: 24 bits value, 8 bits expon.; ~7 decimal places
•Scaling usually used as a preprocessing tool
•Usually scale on like groups of channels
•Amplitude
•Time
Feedforward Calculations
•Input PEs distribute signal forward along multiple paths
•Fully connected, in general
•No feedback loop, not even self-feedback
•Additive sigmoid PE is used in our implementation
Activation of ith hidden PE:
 n


yik  f n   xkhvih 
 h0

where fn(.) is the sigmoid function
0 is bias PE
Sigmoid Activation Function
output 
1
1 e
input
Feedforward calculations, cont’d.
•Sigmoid function performs job similar to electronic amplifier (gain is slope)
•Once hidden layer activations are calculated, outputs are calculated:
h

zkj  f n   yki w ji 
 i 1

where f n  is the sigmoid function
Training by Error Back-propagation
Error per pattern:
Ek  0.5  bkj  z kj 2
q
j 1
Error_signalkj
 kj  f ' (rkj )(bkj  z kj )  z kj (1  z kj )(bkj  z kj )
We derived this using the chain rule.
Backpropagation Training, Cont’d.
•But we have to have weights initialized in order to update them.
•Often (usually) randomize [-0.3, 0.3]
•Two ways to update weights:
Online, or “single pattern” adaptation
Off-line, or epoch adaptation (we use this in our back-prop)
Updating Output Weights
Basic weight update method:
old
wnew

w
ji
ji     kj yki
k
But this tends to get caught in local minima.
So, introduce “momentum” α, [0,1]
old
old
wnew

w



y


(

w

ji
ji
kj ki
ji )
k
(includes bias weights)
Updating Hidden Weights
As derived previously:
q
 ki  yki (1  yki )   kj w ji
j 1
So,
old
old
vihnew  vih
    ki xkh   vih
k
Note: δ’s are calculated one pattern at a time, and are calculated
using “old” weights.
Keep in mind…
In offline training: The deltas are calculated pattern by pattern, while
the weights are updated once per epoch.
The values for η and α are usually assigned to the entire network,
and left constant after good values are found.
When the δ’s are calculated for the hidden layer, the old (existing)
weights are used.
Kohonen Networks
Probably second only to backpropagation in number
of applications
Rigorous mathematical derivation has not occurred
Seem to be more biologically oriented than most paradigms
Reduce dimensionality of inputs
We’ll consider LVQI, LVQII, and Self-Organizing Feature Maps
Initial Weight Settings
1.Randomize weights [0,1].
2. Normalize weights:
w norm

ji
w random
ji

p
i 1
w random
ji

2
•Note: Randomization often occurs in centroid area
of problem space.
Preprocessing Alternatives
1. Transform each variable onto [-1,1]
2. Then normalize by:
a. Dividing each vector component by total length:
aki
aki  l
k
where lk 
 akh2
h
or by
b. “Z-axis normalization with a “synthetic” variable
1
f 
n
aik  f aik
sk  f n  lk2
or by
c. Assigning a fixed interval (perhaps 0.1 or 1/n, whichever
is smaller) to a synthetic variable that is the scale factor
in a. scaled to the fixed interval
Euclidean Distance
d jk t 
 ak i  t   w ji  t  
p
2
i 1
for the j th PE, and the k th pattern
Distance Measures
d lj  l
 ak i  w ji 
n
i 1
l = 1: Hamming distance
l = 2: Euclidean distance
l = 3: ???
l
Weight Updating
Weights are adjusted in the neighborhood only

w ji  t  1  w ji  t     t  ak i  t   w ji  t 

for j  N
 t


Sometimes,  t  0.21   where z = total no. of iterations
 z
Rule of thumb: No. of training iterations should be about 500 times
the number of output PEs.
* Some people start out with eta = 1 or near 1.
* Initial neighborhood shoud include most or all
of output PE field
* Options exist for configuration of output slab: ring,
cyl. surface, cube, etc.
Error Measurement
*Unsupervised, so no “right” or “wrong”
*Two approaches – pick or mix
* Define error as mean error vector length
* Define error as max error vector length (adding PE when
this is large could improve performance)
* Convergence metric:
max_error_vector_length/eta
(best when epoch training is used)
Learning Vector Quantizers: Outline
•Introduction
•Topology
•Network initialization and input
•Unsupervised training calculations
•Giving the network a conscience
•LVQII
•The LVQI implementation
Learning Vector Quantization:
Introduction
•
Related to SOFM
•
Several versions exist, both supervised and unsupervised
• LVQI is unsupervised; LVQII is supervised (I & II do not correspond
to Kohonen’s notation)
•
Related to perceptrons and delta rule, however :
* Only one (winner) PE’s weights updated
* Depending on version, updating is done for correct and/or
incorrect classification
* Weight updating method analogous to metric used to
pick winning PE for updating
* Network weight vectors approximate density function
of input
LVQ-I Network Topology
LVQI Network Initialization and
Input
•LVQI clusters input data
•More common to input raw data (preprocessed)
•Usually normalize input vectors, but sometimes better not to
•Initial normalization of weight vectors almost always done, but in
various ways
•In implementation, for p PEs in output layer, first p patterns chosen
randomly to initiate weights
Weight and Input Vector Initialization
(a) before, (b) after, input vector normalization
LVQ Version I - Unsupervised
Training
•Present one pattern at a time, and select winning output PE based on
minimum Euclidean distance
•Update weights:
old
w new

w
ji
ji    t 
 aki  w ji 
for winner only
new
old
w ji  w ji
for all others
•Continue until weight changes are acceptably small or max. iterations
occur
•Ideally, output will reflect probability distribution of input
•But, what if we want to more accurately characterize the
decision hypersurface?
•Important to have training patterns near decision hypersurface
Giving the Network a Conscience
•The optimal 1/n representation by each output PE is unlikely
(without some “help”)
•This is especially serious when initial weights don’t reflect the
probability distribution of the input patterns
•DeSieno developed a method for adding a conscience to the
network
In example: With no conscience, given uniform
distribution of input patterns, w7 will win about
half of the time, other weights about 1/12 of
the time each.
Conscience Equations


y winner
 1 for min d j  b j , y j  0  other PEs
j
1

bj     f j 
n

(b)
f jnew  f jold    y j  f j old 


(c)
(a)
Conscience Parameters
•
Conscience factor fj with initial value = 1/n
(so initial bias values are all 0)
•
Bias factor γ set approximately to 10
•
Constant β set to about .0001
(set β so that conscience factors don’t reflect noise in the data)
Example of Conscience
If there are 5 output PEs, then 1/n = 0.2 = all initial fj values
Biases are 0 initially, and first winner is selected based on Euclidean
distance minimum
Conscience factors are now updated:
Winner’s fj = [0.2 + 0.0001(1.0 - 0.2)] = 0.20008
All others’ fj = 0.2 - 0.00002 = 0.19998
Winner’s bj = – .0008; all others’ bj = 0.0002
Probability Density Function
Shows regions of equal area
Learning: No Conscience
A = 0.03 for 16,000 iterations
Learning: With Conscience
A = 0.03 for 16,000 iterations
With Conscience, Better Weight Allocation
LVQ - Version II - Supervised
* Instantiate first p ak vectors to weights wji
* Relative numbers of weights assigned by class must
correspond to a priori probabilities of classes
* Assume pattern Ak belongs to class Cr and that the winning
PE’s weight vector belongs to class Cs ; then for winning PE:


old
w new

w
ji
ji    t   a k j  w ji  if Cr  Cs
old
w new

w
ji
ji    t  a k j  w ji if Cr  Cs
For all other PEs, no weight changes are done.
* This LVQ version reduces misclassifications
Evolving Neural Networks: Outline
•Introduction and definitions
•Artificial neural networks
•Adaptation and computational intelligence
•Advantages and disadvantages of previous approaches
•Using particle swarm optimization (PSO)
•An example application
•Conclusions
Introduction
•Neural networks are very good at some problems, such as mapping
input vectors to outputs
•Evolutionary algorithms are very good at other problems, such as
optimization
•Hybrid tools are possible that are better than either approach by itself
•Review articles on evolving neural networks: Schaffer, Whitley, and
Eshelman (1992); Yao (1995); and Fogel (1998)
•Evolutionary algorithms usually used to evolve network weights, but
sometimes used to evolve structures and/or learning algorithms
Typical Neural Network
OUTPUTS
INPUTS
More Complex Neural Network
Evolutionary Algorithms (EAs)
Applied to Neural
Network Attributes
•Network connection weights
•Network topology (structure)
•Network PE transfer function
•Network learning algorithms
Early Approaches to Evolve Weights
•Bremmerman (1968) suggested optimizing weights in multilayer
neural networks.
•Whitley (1989) used GA to learn weights in feedforward network;
used for relatively small problems.
•Montana and Davis (1989) used “steady state” GA to train 500weight neural network.
•Schaffer (1990) evolved a neural network with better generalization
performance than one designed by human.
Evolution of Network Architecture
•Most work has focused on evolving network topological structure
•Less has been done on evolving processing element (PE) transfer
functions
•Very little has been done on evolving topological structure and PE
transfer functions simultaneously
Examples of Approaches
•Indirect coding schemes
Evolve parameters that specify network topology
Evolve number of PEs and/or number of hidden layers
•Evolve developmental rules to construct network topology
•Stork et al. (1990) evolved both network topology and PE transfer
functions (Hodgkin-Huxley equation) for neuron in tail-flip circuitry of
crayfish (only 7 PEs)
•Koza and Rice (1991) used genetic programming to find weights and
topology. They encoded a tree structure of Lisp S-expressions in the
chromosome.
Examples of Approaches, Cont’d.
•Optimization of EA operators used to evolve neural networks
(optimize hill-climbing capabilities of GAs)
•Summary:
•Few quantitative comparisons with other approaches typically
given (speed of computation, performance, generalization,
etc.)
•Comparisons should be between best available approaches
(fast EAs versus fast NNs, for example)
Advantages of Previous Approaches
•EAs can be used to train neural networks with non-differentiable PE
transfer functions.
•Not all PE transfer functions in a network need to be the same.
•EAs can be used when error gradient or other error information is
not available.
•EAs can perform a global search in a problem space.
•The fitness of a network evolved by an EA can be defined in a way
appropriate for the problem. (The fitness function does not have to
be continuous or differentiable.)
Disadvantages of Previous
Approaches
•GAs do not generally seem to be better than best gradient methods such
as quickprop in training weights
•Evolution of network topology is often done in ways that result in
discontinuities in the search space (e.g., removing and inserting connections
and PEs). Networks must therefore be retrained, which is computationally
intensive.
•Representation of weights in a chromosome is difficult.
•Order of weights?
•Encoding method?
•Custom designed genetic operators?
Disadvantages of Previous
Approaches, Cont’d.
Permutation problem (also known as competing conventions
problem or isomorphism problem ): Multiple chromosome
configurations can represent equivalent optimum solutions.
Example: various permutations of hidden PEs can represent
equivalent networks.
We believe, as does Hancock (1992), that this problem is not as
severe as reported. (In fact, it may be an advantage.)
Evolving Neural Networks with
Particle Swarm Optimization
•Evolve neural network capable of being universal approximator, such as
backpropagation or radial basis function network.
•In backpropagation, most common PE transfer function is sigmoidal function:
output = 1/(1 + e - input )
•Eberhart, Dobbins, and Simpson (1996) first used PSO to evolve network
weights (replaced backpropagation learning algorithm)
•PSO can also be used to indirectly evolve the structure of a network. An
added benefit is that the preprocessing of input data is made unnecessary.
Evolving Neural Networks with Particle
Swarm Optimization, Cont’d.
•Evolve both the network weights and the slopes of sigmoidal transfer
functions of hidden and output PEs.
•If transfer function now is: output = 1/(1 + e
evolving k in addition to evolving the weights.
-k*input
) then we are
•The method is general, and can be applied to other topologies and other
transfer functions.
•Flexibility is gained by allowing slopes to be positive or negative. A
change in sign for the slope is equivalent to a change in signs of all input
weights.
Evolving the Network Structure
with PSO
•If evolved slope is sufficiently small, sigmoidal output can be clamped to
0.5, and hidden PE can be removed. Weights from bias PE to each PE in
next layer are increased by one-half the value of the weight from the PE
being removed to the next-layer PE. PEs are thus pruned, reducing
network complexity.
•If evolved slope is sufficiently high, sigmoid transfer function can be
replaced by step transfer function. This works with large negative or
positive slopes. Network computational complexity is thus reduced.
Evolving the Network Structure
with PSO, Cont’d.
•Since slopes can evolve to large values, input normalization is generally
not needed. This simplifies applications process and shortens
development time.
•The PSO process is continuous, so neural network evolution is also
continuous. No sudden discontinuities exist such as those that plague
other approaches.
Example Application: the Iris Data
Set
•Introduced by Anderson (1935), popularized by Fisher (1936)
•150 records total; 50 of each of 3 varieties of iris flowers
•Four attributes in each record
•sepal length
•sepal width
•petal length
•petal width
•We used both normalized and unnormalized versions of the data
set; all 150 patterns were used to evolve a neural network. Issue of
generalization was thus not addressed.
Example Application, Continued
•Values of -k*input > 100 resulted in clamping PE transfer output
to zero, to avoid computational overflow.
•Normalized version of data set first used to test concept of
evolving both weights and slopes. Next we looked at threshold
value for slope at which the sigmoidal transfer function could be
transitioned into a step function without significant loss in
performance.
Performance Variations with Slope
Thresholds
Discussion of Example Application
•Average number of errors was 2.15 out of 150 with no slope
threshold. (This is a good result for this data set.)
•Accuracy degrades gracefully until slope threshold decreases to 4.
•Preliminary indication is that slopes can be evolved, and that a slope
threshold of about 10 to 20 would be reasonable for this problem.
•Other data sets are being examined.
•More situations with slopes near zero are being tested.
Un-normalized Data Set Results
One set of runs; 40 runs of 1000 generations
Number correct 149 148 147 146 145
Number of runs
with this number 11 16
correct
6
3
1
144 100 99
1
1
1
Good solution obtained in 38 of 40 runs. Average number correct was
145.45. Ignoring two worst solutions, average of only 2 mistakes.
Examples of Recent Applications
•Scheduling (Integrated automated container terminal)
•Manufacturing (Product content combination optimization)
•Figure of merit for electric vehicle battery pack
•Optimizing reactive power and voltage control
•Medical analysis/diagnosis (Parkinson’s disease and essential tremor)
•Human performance prediction (cognitive and physical)
Conclusions
•Brief review of applying EC techniques to evolving neural networks
was presented. Advantages and disadvantages were summarized.
•A new methodology for using particle swarm optimization to
evolve network weights and structures was presented.
•The methodology seems to overcome the first four disadvantages
discussed.
•We believe that multimodality is a help rather than a hindrance
with EAs (including PSO).
•Iris Data Set was used as an example of new approach.
The BP Software
An implementation of a fully-connected feed-forward network.
main() routine
BP_Start_Up()reads parameters from input (run) file
and allocates memory
BP_Clean_Up()stores results in output file and deallocates memory
bp_state_handler() is the most important part of the BP state
machine
Output PEs can be linear or sigmoid; hidden are always sigmoid.
Number of layers and number of PEs per layer can be specified.
Back-prop. State Transition Diagram
BP Software, Cont’d.
Enumeration data types used for:
•NN operating mode (train or recall)
•PE function type
•Nature of the layer (input, hidden, output)
•Training mode (offline or online)
•States in the state machine
Enumeration Data Types for All NNs
Enumeration Data Types for Back-prop.
BP Software, Cont’d.
Structure data types used for:
•PE configuration
•Network configuration
•Environment and training parameters
•Network architecture
•Pattern configuration
Structure Data Type Example
Structure data type BP_Arch_Type defines the network
architecture:
Number of layers
Pointer to layers
Pointer to number of PEs in hidden layers
BP State Handler
•Total of 15 states
•Most important part of the state machine
•Routes program to proper state
Running the BP Software
To run, you need bp.exe and a run file, such as iris_bp.run
First train, then test.
For example:
To train, run: bp iris_bpr.run
You will get: bp_res.txt (weights of trained net)
You will see (or you can >filename1): error values for each iteration
To test, run: bp iris_bps.run
You will get: bp_test.txt (summary of correct patterns)
You will see (or >filename2): detailed results
(I run bp iris_bps.run >irisres.txt)
Sample BP Run File
0
0
0.075
0.15
0.01
10000
99
3
4
150
4
3
iris.dat
0=train 1=test
if train, 0=batch 1=sequential
learning rate
momentum rate
error termination criterion (not implemented)
max number of generations
number of training patterns
number of layers ( 3 -> one hidden layer)
number of PEs in hidden layer
total number of patterns in pattern file
dimension of input
dimension of output
data file
Choosing BP Network Parameters
How many hidden PEs?
Guess/estimate:
C ni2  nl2
where C is [1,2]
(This is only a “rule of thumb.”)
Choosing BP Network Parameters
•Too few hidden PEs, and network won’t generalize or won’t train
•Too many hidden PEs, and the net will “memorize”
•Assign one output PE per class
•Probably best to start with low values for η and α
•Avoid getting stuck on an error value that’s too high, maybe .06 or
.08 SSE/pattern/PE
•I often try values of η between 0.02 and 0.20, and α = [0.01, 0.10]
The Kohonen Network
Implementations
Learning vector quantization (LVQ) software
implementation is presented first.
The self-organizing feature map (SOFM) is presented
next.
LVQ Software
General definitions (in BP section) are still valid.
New data types are defined in enumeration and structure data type
code.
Enumeration types: Network can be trained randomly or sequentially,
and can use (or not use) a conscience (described later).
Structure types: Establish PE type, define environment parameters
such as training parameters, flag for conscience, and the number of
clusters, which is the number of output PEs.
LVQ Software, Cont’d.
LVQ Software, Cont’d.
main() routine
LVQ_Start_Up()reads parameters from input (run) file and
allocates memory
LVQ_Main_Loop is the primary part of the implementation
LVQ_Clean_Up()stores results in output file and de-allocates
memory
The LVQ implementation has 13 states.
LVQ State Diagram for Training Mode
LVQ Software, Cont’d.
Output PEs are linear.
Weights (from all inputs to an output) are normalized.
Euclidean distance calculated between input vector and each weight
vector.
The output PE with the smallest distance between input and weight
vectors is selected as winner.
Weight vector of winning PE is updated, then the learning rate is
updated.
If conscience is used, the conscience factor is updated.
LVQ Run File
0
0
0.3
0.999
10
0.0001
0.001
500
99
1
6
0=train, 1=test
0=random pattern selection, 1=sequential
initial learning rate
learning rate shrinking factor  : t 1   t
bias factor (gamma)
beta
training termination criterion
max number of iterations
number of training patterns
1=conscience
max number of clusters
150
4
3
iris.dat
total number of patterns
input dimension
output dimension
data file
LVQ Results File Example
0.789628
0.573990
0.213485
0.038044
Weights to first ouput PE (first cluster)
0.696514
0.335583
0.592744
0.225625
0.727000
0.299744
0.589254
0.185483
0.808415
0.529362
0.254345
0.039350
0.207525
0.075463
0.130591
0.966532
0.760180
0.348239
0.524717
0.159773
Sixth cluster weights
LVQ Test File Example
Cluster Class 0
Class 1
Class 2
----------------------------------
0
0
0
26
1
0
25
0
2
0
22
6
3
29
0
0
4
21
0
0
5
0
3
18
Class 0: clusters 3 and 4
Class 1: clusters 1 and 2
Class 2: clusters 0 and 5
141 out of 150 clustered “correctly”
Self Organizing Feature Maps
An extension of LVQ; use LVQ features such as the conscience
Also developed by Teuvo Kohonen
Utilize slabs of PEs
Incorporate the concept of a neighborhood
Primary features of input cause corresponding local responses
in the output PE field.
Are non-linear mappings of input space onto the output PE
space (field).
SOFM Slab of PEs
•PEs in a slab have similar attributes.
•The slab has a fixed topology.
•Most slabs are two-dimensional.
Hexagonal Slab of PEs
SOFM Network Model
More likely to use raw data as input to SOFM.
Kohonen often initializes weight vectors to be between 0.4 and 0.6 in length.
Winning output PE has minimum Euclidean distance
between input and weight vectors.
d jk 
(Can use conscience)
2


a

w
 ki
ji
n
i 1
SOFM Weight Updating
Weight updates made to winning PE and its neighborhood.
Learning coefficient and neighborhood both shrink over time.
w ji (t  1)  w ji (t )  n(t ) (t )( ki  w ji )
Sometimes,
 t
 t   0.21  
 z
where z = total number of iterations, and t is the iteration index.
SOFM Neighborhood Types
Hats
Sombrero
Stovepipe hat
Chef’s hat
SOFM Phases of Learning
Two phases of learning in the Kohonen SOFM:
1. Topological ordering, where the weight vectors order themselves.
2. Convergence, in which fine tuning occurs.
SOFM Hints
Rule of thumb: No. of training iterations should be about 500 times the
number of output PEs.
Some people start out with eta near 1.0.
The initial neighborhood should include most or all of the output PE slab.
Options exist for the configuration of the output slab: ring, cylindrical
surface, cube, etc.
SOFM Error Measurement
Unsupervised, so no right or wrong
Two approaches – pick or mix
• Define error as mean error vector length
• Define error as max error vector length (adding PE
when this is large could improve performance)
Convergence metric could be:
Max_error_vector_length/eta
(best when epoch training is used)
SOFM Advantages
•Can do real-time non-parametric pattern classification
•Don’t need to know classes a priori
•Does nearest neighbor-like classifications
•Relatively simple paradigm
•Can deal with many classes
•Can handle high-dimensionality inputs
SOFM Disadvantages
•Long training time
•Can’t add new classes without retraining
•Hard to figure out how to implement
•Not good with parameterized data
•Must normalize input patterns (?)
SOFM Applications
•Speech processing
•Image processing
•Data compression
•Combinatorial optimization
•Robot control
•Sensory mapping
•Preprocessing
SOFM Run File
0
0
0.3
0.999
10
0.0001
0.001
500
99
1
1
1
4
4
0
150
4
3
iris.dat
Training/recall 0 = train; 1 = recall
Training mode if training, 0 = random
Learning rate
Shrinking coefficient
Bias factor
Beta
Training error criterion for termination
Maximum number of generations
Number of patterns used for training
1 = conscience; 0 = no conscience
Initial width of neighborhood
Initial height of neighborhood
Output slab height
Output slab width
Neighborhood function type (0 = chef hat)
Total number of patterns
Input dimension
Output dimension
Data file for patterns
SOFM Weights File
0.762695
0.409230
0.477594
0.150768
Weights from inputs to
first output PE
0.744240
0.379303
0.521246
0.174752
0.776556
0.443671
0.428612
0.128095
0.757758
0.397594
0.492467
0.158740
0.778668
0.421259
0.446406
0.130147
0.758743
0.376317
0.507493
0.158574
0.765185
0.391521
0.488735
0.149472
0.748811
0.363523
0.527234
0.170756
0.769893
0.418876
0.460760
0.139670
0.731007
0.357475
0.549875
0.188358
0.784809
0.461558
0.398094
0.112071
0.745425
0.374062
0.523810
0.173326
0.785969
0.437032
0.421340
0.117167
0.752147
0.362909
0.524794
0.164813
0.771124
0.401549
0.473482
0.141214
0.736854
0.345378
0.551698
0.182727
First PE
O O O O
O O O O
O O O O
O O O O
Last PE
Weights from inputs to last output
PE
SOFM Test Results
Class 0 Class 1 Class 2
----------------------------------------------------00 00
0
0
0
00 01
0
0
0
00 02
0
0
0
00 03
0
1
0
01 00
0
0
0
01 01
50
0
0
01 02 0
0
0
01 03
0
1
0
02 00
0
3
0
02 01
0
1
0
02 02
0
4
0
Also output is cluster
02 03
0
1
0
assignment for each pattern.
03 00 0
7
25
03 01
0
3
0
03 02
0
14
0
03 03
0
15
25
Attributes Needed to Specify a Kohonen
SOFM









Number and configuration of input PEs
Number and configuration of output PEs
Dimensionality of output slab (1, 2, 3, etc.)
Geometry of output slab (square or hexagonal neighborhood,
wraparound or not)
Neighborhood definition as function of time
Learning coefficient as function of time and space
Initialization of weights
Preprocessing (normalization) and presentation (random or not) of
inputs
Method to select winner (Euclidean distance or dot product)
Summary of SOFM Process
Allocate storage
Read weights and patterns
Loop through iteration
Loop through patterns
Compute activations
Find winning PE
Adapt weights of winner and its neighborhood
Shrink neighborhood size
Reduce learning coefficient eta
If eta <= 0, break
Write final weights
Write activation values
Free storage
Evolutionary Back-Propagation
Implementation
•A merger of the back-propagation implementation and the
PSO implementation
•PSO is used only to evolve weights (not slopes of sigmoid
functions)
•BP is used only in recall mode; the outputs are used to
evaluate fitness for each particle (candidate set of weights)
Evolutionary BP, Cont’d.
•Both BP and PSO start-up and clean-up routines are included
•Length of individual particles is calculated from dimensions in
input file
•Particle elements correspond to individual weights
•BP recall is run for each particle after each iteration of PSO to
evaluate fitness (error)
•The BP network is the “problem” for PSO to solve
Main Routine for Evolutionary Back-Prop
void main (int argc, char *argv[])
{
// check command line
if (argc != 3)
{
printf("Usage: exe_file pso_run_file bp_run_file\n");
exit(1);
}
// initialize
main_start_up(argv[1],argv[2]);
PSO_Main_Loop();
main_clean_up();
}
static void main_start_up (char *psoDataFile,char *bpDataFile)
{
BP_Start_Up(bpDataFile);
PSO_Start_Up(psoDataFile);
}
static void main_clean_up (void)
{
PSO_Clean_Up();
BP_Clean_Up();
}
Running the Evolutionary BP Network
Implementation
•Need the executable file pso_nn.exe
•Need two run files, such as pso.run and bp.run
•PSO run file same as for single PSO, except that length of particle not
specified
•BP run file is short; only information for recall needed
Example bp.run:
3
4
150
4
3
iris.dat
#
#
#
#
#
of layers
hidden PEs
patterns
inputs
outputs
data file
PSO Run File
1
0
1
// num of psos
// pso_update_pbest_each_cycle_flag
// total cycles of running PSOs
1
17
1
0
-10.0
10.0
5
10
200
//
//
//
//
//
//
//
//
//
30
// population size
0.9
0
// initial inertia weight
// boundary flag
// boundaries if boundary flag is 1
optimization type: min or max – max. no. correct
evaluation function – 17 calls BP weights from PSO
inertia weight update method
initialization type: sym/asym
left initialization range
right initialization range
maximum velocity
maximum position
max number of generations
BP_RES.TXT Output File
Weights from
…
Weights from
Weights from
…
Weights from
inputs to first hidden PE (bias first)
inputs to last hidden PE (bias first)
first hidden to first output PE (bias first)
last hidden to last output PE (bias first)
-2.555491 Weights to first hidden PE (bias first)
-3.560039
2.198371
8.452043
-0.000573
-4.703630
6.440988
8.627151
-3.195024
0.699212
-1.443098
-6.584295
0.430629
2.237892
0.960514
-5.099212 Weights to 4th hidden PE (bias first)
-3.314713
0.362337
-8.708467
-3.981537
-5.676066
Weights to first output PE (bias first)
2.128347
-1.152100
5.140296
-3.994824
4.449585
-2.012187
0.222005
-3.648189
-1.876380
7.973076
Weights to 3rd output PE (bias first)
6.194356
-0.598305
-6.768669
-11.408623
BP_RES.TXT Output File
Example Application: the Iris Data Set



Introduced by Anderson (1935), popularized by Fisher
(1936)
150 records total; 50 of each of 3 varieties of iris flowers
Four attributes in each record





sepal length
sepal width
petal length
petal width
We used both normalized and unnormalized versions of
the data set; all 150 patterns were used to evolve a neural
network. Issue of generalization was thus not addressed.
Example Application, Continued


Values of -k*input > 100 resulted in clamping PE
transfer output to zero, to avoid computational overflow.
Normalized version of data set first used to test concept
of evolving both weights and slopes. Next we looked at
threshold value for slope at which the sigmoidal transfer
function could be transitioned into a step function
without significant loss in performance.
Performance Variations with Slope
Thresholds
Slope threshold s
Total number
Average number
(absolute value) correct in 40 runs correct per run
Variance
None
5914
147.85
1.57
80
5914
147.85
1.57
40
5911
147.78
1.77
20
5904
147.60
1.94
10
5894
147.35
2.08
4
5814
145.35
62.75
For each threshold value, 40 runs of 1000 generations
were made of the 150-pattern data set.
Discussion of Example Application





Average number of errors was 2.15 out of 150 with
no slope threshold. (This is a good result for this
data set.)
Accuracy degrades gracefully until slope threshold
decreases to 4.
Preliminary indication is that slopes can be evolved,
and that a slope threshold of about 10 to 20 would
be reasonable for this problem.
Other data sets are being examined.
More situations with slopes near zero are being
tested.
Un-normalized Data Set Results

One set of runs; 40 runs of 1000 generations
Number correct 149 148 147 146 145 144 100 99
Number of runs
with this number 11 16
correct
6
3
1
1
1
1
Good solution obtained in 38 of 40 runs. Average number
correct was 145.45. Ignoring two worst solutions,
average of only 2 mistakes.
Examples of Recent Applications






Scheduling (Integrated automated container
terminal)
Manufacturing (Product content combination
optimization)
Figure of merit for electric vehicle battery pack
Optimizing reactive power and voltage control
Medical analysis/diagnosis (Parkinson’s disease and
essential tremor)
Human performance prediction (cognitive and
physical)