Download Neural Nets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Perceptual control theory wikipedia , lookup

Concept learning wikipedia , lookup

Gene expression programming wikipedia , lookup

Machine learning wikipedia , lookup

Neural modeling fields wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Pattern recognition wikipedia , lookup

Backpropagation wikipedia , lookup

Convolutional neural network wikipedia , lookup

Catastrophic interference wikipedia , lookup

Transcript
Artificial Neural Networks
ANN is an abstract computational model of the human brain.
Human brain has the ability to learn and generalize.
Learning: connecting neurons into paths, i.e. neurons “fire” other neurons (using the process of
chemical change), and “remembering” those paths.
Recall: activating a collection of neurons in time.
Generalization: similar stimuli recall similar pattern of activity
Associative memory: ability to store patterns and recall those patterns when only parts of them
are given as inputs.
ANN definition p.196: “ANN is a massive parallel distributed processor made up of single
processing units. It has the ability to learn from experiential knowledge expressed through
interunit connection strengths, and can make such knowledge available for use.”
Key words:
 parallel distributed structure;
 ability to learn and generalize (leading to fault tolerance, learning from training data, and
recall).
ANNs are used for:
 Pattern recognition
 Pattern association
 Function approximation
 Control
 Filtering
 Smoothing
 Prediction
An ANN is defined by four parameters:
1. Type of neurons
2. Connectionist architecture (i.e. the organization of connections between neurons)
3. Learning algorithm
4. Recall algorithm
All nodes are adaptive (i.e. can learn).
ANN can be viewed as a graph:
 Each node is a neuron (i.e. a processing unit)
 Edges are causal relationships between nodes.
1. Type of neurons (i.e. nodes)
An artificial neuron k is defined by four parameters:
1. Input connections (i.e. inputs)
X=<x1 x2 … xm>
and their weights
Wk = < wk1 wk2 … wkm>.
m is the number of input signals. (In KDD, in the input stage of the network, m is the number
of features.)
An input to a neuron can also be the externally applied bias function bk.
2. Input function net=g(X,W). Usually, function g is the linear combiner, i.e.
netk = SUM[i=1,n] (xi * wki ), where n is the number of neurons.
It is assumed that bk = wk0 and x0 = 1, so that netk is the scalar product of two vectors:
netk = X * W.
3. An activation (signal) function a = f(net). It calculates the activation level of the neuron.
4. An output function y. It calculates the output signal emitted through the output of the
neuron. Usually, y = a.
The picture of neuron k is shown below.
x1
x2
∑
xm
net
f(net)
yk
bk
Example on p.199:
Assume X = <0.5 0.5 0.5> W = <0.3 0.2 0.5>, b = -0.2, one neuron.
net = 0.5*0.3 + 0.5*0.2 + 0.5*0.2 + (-0.2) = 0.15
a. symmetrical hard limit: y = 1
b. saturating linear: y = net = 0.15
c. log sigmoid: y = 1/(1+e^(-0.55)) = 0.54
2. Type of connections (i.e. connectionist architecture)
Connectionist architecture (i.e. NN topology) is the organization of connections between
neurons.
Neurons can be:
 fully connected (every neuron is connected to every other neuron)
 partially connected (usually, only connections between neurons of different layers are
allowed).
Neurons are arranged in layers. ANNs can be:
 autoassociative: if the input neurons are also the output neurons
 heteroassociative: there are separate sets of input and output neurons.
Based on presence of loops, ANN architecture can be:
 feedforward: no connections from the output to the input
 feedback (i.e recurrent); there are connections from the output to the input neurons
Example p. 199 cntd.:
Connect the nodes from p.199 into a feedforward network, symmetric saturating linear
activation, bias=0:
0.2
1
x1
y1
1
0.5
x2
-0.6
-1
y3
3
2
X = < x1 x2 >
W1 = < w11 w12 > = < 0.2 0.5>
W2 = < w21 w22 > = < -0.6 -1>
W3 = < w31 w32 > = < 1 -0.5>
-0.5
y2
net1 = 1*0.2 + 0.5*0.5 = 0.45
net2 = 1*(-0.6) + 0.5*(-1) = -1.1
net3 = 0.45*1 + (-1)*(-0.5) = 0.95
=>
=>
=>
y1 = net1 = 0.45
y2 = -1
y3 = net3 = 0.95
Why multilayer: not possible to describe some problems with one layer.
Example p.201: XOR problem
Desired: make an ANN which will produce Y = X1 xor X2 on inputs X1 and X2.
X1 X2 Y
0
0
0
0
1
1
1
0
1
1
1
0
Problem: there is no single line that can “cut” X1 X2 space into two proper regions. Therefore,
cannot use a single-layer neural net.
Solution: use multilayer network, for example the below with hardlimit activation:
-1
2
1
x1
y1
1
2
x2
-1
y
3
2
-1
1
X = < x1 x2 >
W1 = < w11 w12 > = < 2 2>
W2 = < w21 w22 > = < -1 -1>
W3 = < w31 w32 > = < 1 1>
-1.5
y2
1.5
Check that it works:
x1 x2 net1 net2
0 0 -1
1.5
0 1 1
0.5
1 0 1
0.5
1 1 3
-0.5
n1
0
1
1
1
n2
2
1
1
0
…
net3
-0.5
0.5
0.5
-0.5
y
0
1
1
0
3. Learning Algorithm
Learning: possible modification of behavior in response to the environment.
Intelligence = quantity/speed of learning.
In NN, learning is a process (i.e. learning algorithm) by which the parameters of ANN are
adapted. Learning occurs when a training example causes change in at least one synaptic weight.
Learning can be seen as “curve fitting problem.” As NN learns and weights keep on changing,
the network reaches convergence between the output and the input. Learning stops when
applying training samples to the NN do not cause change in synaptic weights. Either the network
has learned the training samples, or the network is saturated. See the end of this section for
discussion on when to stop.
NN learning can be:
 Supervised (e.g. classification, function approximation)
 Unsupervised (e.g. clustering, feature summary)
 Reinforcement learning (reward-penalty learning, error corrective learning): present the input
to ANN, calculate output, then adjust the weights to reduce error. Repeat until error is
sufficiently small. Usually adjusting weights backwards (backpropagation learning).
Algorithm:
1. set an appropriate structure of NN, for example:
a. (n+1) input neurons (1 for bias and n for n input variables)
b. m output neurons
c. set the initial weights
2. Supply an input vector x from the set of the training samples
3. Calculate the output vector o produced by the NN.
4. Compare the desired output y and the output o produced by the NN. If possible,
evaluate the error
5. Correct the connection weights in such a way that the next time x is presented to the
network, the produced output becomes closer to the desired output.
6. If necessary, repeat steps 2-5 until the network reaches convergence state.
How to evaluate error? For example:
Instantaneous error: Error = (o – y) or | o – y |
Mean-square error: Error = (o-y)2/2
Etc.
90% of ANNs used today are multilayer with backpropagation learning mechanism.
x1
Real output: yk (n) = SUM[i=1,n] (xi*wki )
x2
ANN
Desired output: dk (n)
xm
n: number of the sample we are
considering, i.e. number of the current
iteration.
Error: ek (n) = dk (n) - yk (n)
Total cost: E(n) = 0.5* ek2 (n). Minimize it. Update the weights using this result.
There are many algorithms that calculate updated weights (i.e. learning law algorithms). For
example:
Hebian learning law
If two neurons i and j are repeatedly simultaneously activated by input stimuli, the synapse
between them increases its strength wij. The change in weight is:
Δwij = c * ai * aj
Where ai and aj are activation values.
http://uhaweb.hartford.edu/compsci/neural-networks-hebb-rule.html
Learning with forgetting:
Introduce decay coefficient di.
There are many algorithms with forgetting, for example:
Δw’ij (t) = Δwij (t) –d*sign(wij (t))
Hebian learning with forgetting:
Δwij (t+1) = -d* Δwij (t) + c* ai * aj
In order to increase robustness and performance of a network, introduce small random signals as
noise.
Hebian learning law with noise:
Δwij = c * ai * aj + noise
Error backpropagation algorithm
Used in multilayer perceptrons
Error corrective learning using delta-rule (also called Least Mean Square (LMS) method
or Widrow-Hoff rule):
Δwkj (n) = η * ek (n) * xj (n), η is positive constant.
η determines the rate of learning, and must be sufficiently small to produce good learning (but
too small makes learning too slow); picking large η speeds up the learning but can cause
instability.
Updated weights are:
wkj (n+1) = wkj (n) + Δwkj (n)
Assume there are N samples.
http://uhaweb.hartford.edu/compsci/neural-networks-delta-rule.html
Example p.204:
Assume η = 0.1, linear activation (y = net), one neuron, bias = 0, X = <x1 x2 x3>,
W = <0.5 -0.3 0.8>, and:
Sample
number, i.e. n
1
2
3
x1
x2
x3
d
1
-1
0.3
1
0.7
0.3
0.5
-0.5
-0.3
0.7
0.2
0.5
Sample 1: net = 1*0.5 + 1*(-0.3) + 0.5*0.8 = 0.6
y(1) = 0.6,
e(1) = d(1)-y(1) = 0.7-0.6=0.1
Δw1 (1) = η * e (1) * x1 (1) = 0.1*0.1*1 = 0.01 => w1 (2) = 0.5+0.01 = 0.51
Δw2 (1) = η * e (1) * x2 (1) = 0.1*0.1*1 = 0.01 => w2 (2) = -0.3+0.01 = -0.29
Δw3 (1) = η * e (1) * x3 (1) = 0.1*0.1*1 = 0.005 => w3 (2) = 0.8+0.005 = 0.805
Repeat the whole process using the new weights (i.e. with W = <0.51 -0.29 0.805>):
Sample 2: net = -1*.51 + .7*(-0.29) + (-0.5)*0.805 = -1.1,
y (2) = -1.1, d (2) = 0.2,
e (2) = d (2) -y(2) = 1.3
Δw1 (2) = η * e (2) * x1 (2) = 0.1*1.3*(-1) = -0.13 => w1 (3) = 0.51-0.13 = …
Δw2 (1) = η * e (2) * x2 (2) = 0.1*1.3*0.7 = 0.091 => w2 (3) = -0.29+ 0.091
Δw3 (1) = η * e (2) * x3 (2) = … . =>
w3 (3) = …
Repeat the whole process using the new weights (i.e. with W = < w1(3) w2(3) w3(3) >):
Sample 3: …. Will produce W = < w1(4) w2(4) w3(4) >
When do we stop readjusting the weights? When the network reaches convergence but is not
overfitted. How to determine if the network reached convergence and is not overfitted?
Grossberg saturation theorem states that large input signals saturate a neuron when it is sensitive
to small input signals. If the neuron is sensitive to large input signals, small signals are ignored
as noise.
For example: the NN reaches convergence when weights change within the magnitude of the
noise:
Δwij <= noise
Or, weights oscillate between two or more states.
If a NN learns from too many input-output training samples, it can memorize the data. This
process is called overfitting or overtraining. Solution:
 Use the simplest net possible:
o Small number of layers
o Small number of parameters (much less than the number of data points)
 Stop training before the net overfits
4. Recall algorithm
Mimics the associative memory and the generalization process of human brain – when fed pieces
of patterns, recall similar patterns.
In most simple case, recall algorithm just calculates the output of NN given the input sample.
Network is said to generalize well when the input-output mapping for test data (not used for
training) is correct.
Supervised Learning Example: Perceptrons, Multilayer Perceptrons (MPNN)
and Error Backpropagation Algorithm
Perceptron is a feedforward NN consisting of a single layer of neurons sandwitched between the
input and output layer. The input layer is just a buffer for storing input data. The output layer is
often just an “outlet.” Thus, a perceptron NN officially has three layers, but only one or two of
those layers has neurons.
Perceptrons can recognize only linearly separable classes, i.e. clusters which are distinguishable
when a linear line is drawn through them (for example, the XOR problem we saw earlier).
Solution: use multilayer perceptrons. A multilayer perceptorn consists of:
1. Input layer (consisting of buffers for storing data)
2. Several hidden layers consisting of neurons. Each layer is fully connected to the next
layer (i.e. each neuron in layer j is connected to all other neurons in layer j+1).
3. Output layer consisting of neurons producing output signals (i.e. output response).
Only input and output layer are visible to the user of MPNN.
x1
y1
x2
ym
xn
Error Backpropagation Algorithm
Has two phases:
1. Forward pass:
a. calculate outputs in the feedforward manner;
b. calculate error at the output layer
2. Backward pass:
a. Using the error at the output layer, adjust the weights of the inputs to the output
layer
b. go backwards all the way to the input layer: calculate error of the previous layer
and adjust its weights.
Repeat steps 1-2 until the network reaches convergence; for example, average error energy is
sufficiently small.
The beginning of the algorithm is the same as for most learning algorithms, as we discussed in
the section on learning:
Real output of neuron j at iteration n is yj (n), which is a function applied to vj (n):
vj (n) = SUM[i=1,m] (xi*wji ) // we called vj (n) “net” before
yj (n) = φ(vj (n))
m is the number of input signals to neuron j, i.e. the number of neurons in the previous layer in
the case of MPNN.
Desired output of neuron j at iteration n (each iteration is application of new training sample, so
iteration n is obtained when sample n is applied to the network):
dj (n)
Error for neuron j at iteration n:
ej (n) = dj (n) – yj (n)
Instantaneous value of error energy for neuron j at iteration n:
Ij(n) = 0.5* ej2 (n).
Total error energy for the entire network:
E(n) = 0.5* SUM[for all j in the output layer] ej2 (n).
Average error energy for the entire network:
Eav = (1/N)* SUM [n=1, N] E(n)
Where N is the total number of input samples.
Minimize E(n). Update the weights using this result.
At this point, the error backpropagation algorithm specifics begin.
E(n) is minimized in the true mathematical sense: by finding the first derivative with respect to
weights wji. After some calculations (see p.210) we get:
∂E(n)/ ∂wji = -ej(n) * ( ∂yj (n)/ ∂vj (n) )* xi(n)
Let’s label:
φ'(vj (n)) = ∂yj (n)/ ∂vj (n)
δj (n) = ej(n) * φ'(vj (n))
//we will call it local gradient
Therefore:
∂E(n)/ ∂wji = -ej(n) * φ'(vj (n)) * xi(n)
Δwji (n) = - η * ∂E(n)/ ∂wji
//using the delta rule (i.e. Widrow-Hoff rule)
//specified in the learning algorithm section
So η is the learning rate.
Δwji (n) = -ej(n) * φ'(vj (n)) * xi(n)
= η * δj (n) * xi(n)
Backpropagation learning algorithm backward step:
φ'(vj (n)) = ∂yj (n)/ ∂vj (n)
If neuron j is in the output layer,
δj (n) = ej(n) * φ'(vj (n))
If neuron j is in a hidden layer,
δj (n) = φ'(vj (n)) * SUM [k = 1, Nj+1] ( δk (n) * wkj(n))
Δwji (n) = η * δj (n) * xi(n)
where Nj+1 is the total number of neurons in the next layer, i.e. layer j+1, which are connected to
the neuron j.
In order to avoid instability, calculate Δwij(n) with the momentum term:
Δwji (n) = η * δj (n) * xi(n) + α * Δwji (n-1)
α > 0 and is called momentum constant, usually 0.1 ≤ α ≤ 1.
Heuristics:
 η = 0.1
 α = 0.5
 initial weights are very small uniformly distributed random numbers
 start with a small number of hidden nodes and layers, analyze results, keep increasing the
number of hidden nodes until the error does not improve
See discussion at the end of the learning algorithm section.
Competitive Networks and Competitive Learning
Multilayer perceptron: several output neurons can fire simultaneously
Competitive networks: only one output neuron is active at a time (“winner takes all”)
1. output neurons have limits on the strength of their output
2. competition mechanism
Simplest case: one layer (with lateral inhibition)
X = <x1 x2 … xn>
Y = <y1 y2 … ym>
Weight wkj connects xj to yk.
Inhibition hkj connects input j to node k.
x1
1
net1
y1
net2
x2
2
y2
netm
xn
m
ym
wmn
Algorithm:
1. Forward phase: calculate net’s
netk = SUM[for all inputs j that connect to node k] xj*wkj
Include lateral inhibitory connections:
netk = netk + SUM[for all nodes j that inhibit node k] xj*hkj
2. Competition phase:
if netk > netj for all jk
yk = 1 //neuron k won the competition
else
yk = 0
If neuron k wins the competition, change its weight:
wkj = *(xj – wkj )
else
wkj = 0
Usage: each output neuron represents a cluster.
Example p.217:
X = <1 0 1>
W1 = <w11 w12 w13> = <0.5 0 –0.5 >
W2= <w21 w22 w23> = <0.3 0.7 0>
W3 = <w31 w32 w33> = <0 0. 2 –0.2>
C1 = <c11 c12 c12> = <0
0.5
C2 = <c21 c22 c23> = <0.2 0
C3 = <c31 c32 c33> = <0.4 0.2
0.6>
0.1>
0>
Competitive NNs and Competitive Learning