Download Multilayer Neural Networks - Computing Science

Document related concepts

Zero-configuration networking wikipedia , lookup

Distributed firewall wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Computer network wikipedia , lookup

Piggybacking (Internet access) wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Network tap wikipedia , lookup

Airborne Networking wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Transcript
Neural Network
Fall 2013
Comp3710 Artificial Intelligence
Computing Science
Thompson Rivers University
Course Outline



Part I – Introduction to Artificial Intelligence
Part II – Classical Artificial Intelligence
Part III – Machine Learning





Introduction to Machine Learning
Neural Networks
Probabilistic Reasoning and Bayesian Belief Networks
Artificial Life: Learning through Emergent Behavior
Part IV – Advanced Topics
TRU-COMP3710
Intro to Machine Learning
2
A new sort of computer

What are (everyday) computer systems good at... and not so
good at?
Good at
Rule-based systems: doing
what the programmer wants
them to do
Not so good at
Dealing with noisy data
Dealing with unknown
environment data
Massive parallelism
Fault tolerance
Adapting to circumstances

How about humans?
Reference


Artificial Intelligence Illuminated, Ben Coppin, Jones and Bartlett
Illuminated Series
Many tutorials from the Internet

Neural Networks with Java


Joone – Java Object Oriented Neural Engine





http://fbim.fh-regensburg.de/~saj39122/jfroehl/diplom/e-main.html
http://www.jooneworld.com./
http://www.ai-junkie.com/ai-junkie.html
http://www.willamette.edu/~gorr/classes/cs449/intro.html
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
…
Neural Network
4
Learning Outcomes







Train a perceptron with a training data set.
Update the weights in a feed-forward network with error
backpropagation.
Update the weights in a feed-forward network with error
backpropagation and delta rule.
Implement a feed-forward network for a simple pattern recognition
application.
List the two examples of recurrent network.
List applications of Hopfield netorks, Bidirectional Associative
Memories and Kohonen Maps.
Update the weights in a Hebbian network.
Neural Network
5
Chapter Contents
1.
2.
3.
4.
5.
Biological Neurons
How to model biological neurons – artificial neurons
The first neural network – Perceptrons
How to overcome the problems in Perceptron networks – Multilayer
Neural Networks
Feed-forward, Backpropagation, Backpropagation with Delta Rule
Can an ANN remember? – Recurrent Networks
Hopfield Networks
Bidirectional Associative Memories
How to learn without using training data set
6.
7.
8.
9.
Kohonen Maps
Hebbian Learning
Fuzzy Neural Networks
Evolving Neural Networks
Neural Network
6
1. Biological Neurons

Topics
The human brain is made up of about 100 billions of simple processing
units – neurons. -> parallelism; emergent behavior; ...
synapse
axon
nucleus
cell body
dendrites



Inputs are received on dendrites, and if the input levels are over a
threshold, the neuron fires, passing a signal through the axon to the
synapse which then connects to another neuron.
The human brain has a property known as plasticity, which means that
neurons can change the nature and number of their connections to other
neurons in response to events that occur. In this way, the brain is able to
learn.
[Q] How to model?
Neural Network
7
2. Artificial Neurons





Artificial neurons are based on biological neurons.
McCulloch and Pitts (1943)
Each neuron in the network receives one or more inputs.
An activation function is applied to the inputs, which determines
the output of the neuron – the activation level.
Weights associated to synapses
Neural Network
8

Three typical activation functions.
y
1
1  e x
Neural Network
y  ax  b
9

A typical activation function, called step function, works as follows:
w1  0.8; w2  0.4
x1  0.7; x2  0.9
x1:w1
t  0.8
Y
2
X   wi xi  ???
i 1
xn:wn
Y  ???



Each previous node i has a weight, wi associated with it. The sum of all
the weights is 1. The input from previous node i is xi.
t is the threshold.
So if the weighted sum of the inputs to the neuron is above the
threshold, then the neuron fires.
Neural Network
10
Topics






There is no central processing or control mechanism
The entire network is involved in every piece of computation that
takes place.
The processing time in each artificial neuron is small, not like real
biological neuron.
Parallelism of massive number of neurons
The weight associated with each connection (equivalent to a synapse
in the biological brain) can be changed in response to particular sets
of inputs and events. In this way, an artificial neural network is
able to learn.
[Q] How and when to change the weights?
Neural Network
11
3. Perceptrons


A perceptron is a single neuron that classifies a set of inputs into one
of two categories (usually 1 or -1).
Rosenblatt (1958)
Neural Network
12




A perceptron is a single neuron that classifies a set of inputs into one
of two categories (usually 1 or -1).
Rosenblatt (1958)
If the inputs are in the form of a grid, a perceptron can be used to
recognize visual images of shapes.
The perceptron usually uses a step function, which returns 1 if the
weighted sum of inputs exceeds a threshold, and 0 otherwise.
n
w1  0.8; w2  0.4
i 1
x1  0.7; x2  0.9
X   wi xi
1 X  t
Y  Step ( X )  
0 X  t
 n

Y  Step   wi xi 
 i 1

Neural Network
t  0.8
2
X   wi xi  ???
i 1
Y  ???
13

The perceptron is trained as follows:




First, random weights (usually between –0.5 and 0.5) are given.
An item of training data is presented.
Output classification is observed.
If the perceptron mis-classifies it, the weights are modified according to
the following:
wi  wi  (a  xi  e)



a is the learning rate, between 0 and 1.
e is the size of the error
Perceptron training rule for e





0 if the output is correct
Positive if the output is too low
Negative if the output is too high
The train continues until all errors become zero;
Then weights are not changed anymore.
Neural Network
14


Example of logical OR for two inputs, with t = 0 and a = 0.2
Random initial weights
wi  wi  (a  xi  e)
w1  0.2
w2  0.4

The training data
x1  0
x2  0



Expected output is 0 (= 0 OR 0).
Output is correct.
Y  Step  0  0.2  0  0.4  0
The weights are not changed.
Neural Network
15

wi  wi  (a  xi  e)
For the next training data
x1  0;
w1  0.2
x2  1;
w2  0.4
Y  Step  0  0.2  1 0.4   Step(0.4)  1
x1  1;
w1  0.2
x2  0;
w2  0.4
Correct!
Y  Step 1 0.2  0  0.4   Step(0.2)  0
w1  0.2  0.2 11  0
Wrong!
e  1 0  1
w2  0.4  0.2  0  1  0.4
x1  1;
w1  0
x2  1;
w2  0.4
Y  Step 1 0  1 0.4   Step(1)  1
Neural Network
Correct!
16
Epoch
x1
x2
Expected Y
Actual Y
Error
w1
w2
1
0
0
0
0
0
-0.2
0.4
1
0
1
1
1
0
-0.2
0.4
1
1
0
1
0
1
0
0.4
1
1
1
1
1
0
0
0.4
2
0
0
0
0
0
0
0.4
2
0
1
1
1
0
0
0.4
2
1
0
1
0
1
0.2
0.4
2
1
1
1
1
0
0.2
0.4
3
0
0
0
0
0
0.2
0.4
3
0
1
1
1
0
0.2
0.4
3
1
0
1
1
0
0.2
0.4
3
1
1
1
1
0
0.2
0.4
Neural Network
17

[Q] Can you test the previous system with



0.1, 1
-0.7, 0.3
0, -0.2

[Q] Can you develop a system for the Boolean AND operation?

[Q] What is learning in Perceptrons ?
Neural Network
18
Topics



Perceptrons can only classify linearly separable functions. (AND and OR)
The first of the following graphs shows a linearly separable function
(OR).
The second is not linearly separable (Exclusive-OR).


Demo – Perceptron Learning Applet
[Q] How to improve?
Neural Network
19
4. Multilayer Neural Networks

Multilayer neural networks can classify a range of functions,
including non linearly separable ones.
A feed-forward network
Not cellular automata
Neurons work synchronously
[Q] How to train?


Each input layer neuron connects to all neurons in the hidden
layer (or layers).
The neurons in the hidden layer connect to all neurons in the
output layer.
Neural Network
20

Demo – XOR, AND, OR

XOR can be written:



y = (x1  x2)  (x1  x2)
Therefore, y = (x1  x2)  (x1  x2)
Split:



y1 = x1  x2
y2 = (x1  x2)
y = y1  y2
Neural Network
21
Backpropagation
wi  wi  (a  xi  e)



Multilayer neural networks learn in the same way as Perceptrons. But
it take a too long time to train. [Q] Why?
There are many more weights, and it is important to assign credit (or
blame) correctly when changing weights.
Backpropagation networks use the sigmoid activation function, not
a step function, as it is easy to differentiate:
Neural Network
22

For node j,
Xj is the input
yj is the output
n is the number of inputs to node j
j is the threshold for j
n
X j   wij xi   j
i 1
yj 


wij
1
1 e
xi
X j
j
wjk
yj
k
yj
After values are fed forward through the network, errors are fed
back to modify the weights in order to train the network.
For each node, we calculate an error gradient.
Neural Network
23




For a node k in the output layer, the error ek is the difference between
the desired output and the actual output.
The error gradient for k is:
Now the weights are
updated as follows:
Similarly, for a node j in the
hidden layer:

 is the learning rate, (a positive number below 1)

Known as gradient descent
xi
wij
j
wjk
yj
k
yj
Neural Network
24
w j ,k  .45  .2  .9  .2
  .2
w j ,k  .45
w j ,l  .82
y j  .9
 k  .2
 l  .7
xi
wij
j
wjk
yj
wjl
yj
k
l
yk
 .486
w j ,l  .82  .2  .9  (.7)
 .694
 j  y j (1  y j )  w j ,k  k  w j ,l l 
 .9  .1 .486  .2  .694  (.7) 
 .034974

wi , j  wi , j  .2  xi  (.034974)
yl
Neural Network
25



Not likely appear to occur in the human brain
Too slow learning. With some simple problems it can take hundreds
or even thousands of epochs to reach a satisfactorily low level of error.
[Q] Why?


Weights are changed too easily. Can we use a sort of the idea of
Simulated Annealing?
[Q] How to improve?
Neural Network
26
Backpropagation with Delta Rule

Generalized delta rule: Inclusion of momentum, the extent to
which a weight was changed on the previous iteration
w jk (t )    y j   k    w jk (t  1),

w jk (t )  w jk (t  1)    y j   k
w jk (t )  w jk (t  1)  w jk (t )
where 0    1, usually very high such as 0.95

Simple backpropagation
Hyperbolic tangent instead of sigmoid activation function
2a
 a,
 bx
1 e
where a and b are constants, such as a  1.7 and b  0.7
tanh( x) 

…
Neural Network
27
Backpropagation

Advantages:



Downsides:




It works!
Relatively fast
Requires a training set
Can be slow
Probably not biologically realistic
Alternatives to Backpropagation:

Hebbian learning


Reinforcement learning


Not successful in feed-forward nets
Only limited success
Artificial evolution

More general, but can be even slower than backpropagation.
Neural Network
28
Example: Voice Recognition

Task: Learn to discriminate between two different voices saying
“Hello”
[Q] How to implement?

Data


Sources



Steve Simpson
David Raubenheimer
Format

Frequency distribution (60 bins)

Network architecture

Feed forward network




60 inputs (one for each frequency bin)
6 hidden nodes
2 outputs (0-1 for “Steve”, 1-0 for “David”)
[Q] What is the total number of nodes?

Presenting the data
Steve
David

Presenting the data (untrained network)
Steve
0.43
0.26
David
0.73
0.55

Calculate error
Steve
|0.43 – 0|
= 0.43
|0.26 –1|
= 0.74
|0.73 – 1|
= 0.27
|0.55 – 0|
= 0.55
David

Backpropagation error and adjust weights
Steve
David
0.43 – 0
= 0.43
0.26 – 1
= 0.74
1.17
(overall error)
0.73 – 1
= 0.27
0.55 – 0
= 0.55
0.82

Repeat process (sweep) for all training pairs





Present data
Calculate error
Backpropagation error
Adjust weights
Repeat process multiple times

Presenting the data (trained network)
Steve
0.01
0.99
David
0.99
0.01

Results – Voice Recognition

Performance of trained network

Discrimination accuracy between known “Hello”s


100%
Discrimination accuracy between new “Hello”’s

100%

Results – Voice Recognition (ctnd.)

Network has learnt to generalise from original data

Networks with different weight settings can have same functionality

Trained networks ‘concentrate’ on lower frequencies

Network is robust against non-functioning nodes
Topics
Applications of Feed-forward nets

Pattern recognition


Character recognition
Face Recognition

Sonar mine/rock recognition (Gorman & Sejnowksi, 1988)

Navigation of a car (Pomerleau, 1989)

Stock-market prediction

Pronunciation (NETtalk)
(Sejnowksi & Rosenberg, 1987)
5. Recurrent Networks

Feed-forward networks do not have memory. [Q] What does it mean?



Acyclic
Once a feed-forward network is trained, its state is fixed and does not alter
as new input data is presented to it.
Recurrent networks, or also called feedback networks, can have
arbitrary connections between nodes in any layer, even backward from
output nodes to input nodes. The internal state can alter as sets of
input data are presented to it –> a memory. [Q] What does it mean?
Or called memory units
Neural Network
40






Biological nervous systems show high levels of recurrency (but feedforward structures exist too).
Recurrent networks can be used to solve problems where the solution
depends on previous inputs as well as current inputs (e.g. predicting
stock market movements).
Inputs are fed through the network, including feeding data back from
outputs to inputs, and repeat this process until the values of the outputs
do not change – a state of equilibrium or stability.
The stable values of the network are called as fundamental memories.
But, it is not always the case that a recurrent network reaches a stable
state.
Recurrent networks are also called attractor networks.
Neural Network
41
Hopfield Networks


A Hopfield Network is a recurrent network.
Use a sign activation function:


1 X  0
Sign( X )  
1 X  0
If a neuron receives a 0 as an input it does not change state – in other
words, it continues to output the same previous value.
Weights are usually represented as matrices.
N
W   X i X it  N  I , where
i 1
X i is an input vector of m input values;
I is m  m identity matrix
N is the number of states ( X i ) that are to be learned
Neural Network
42

Three states to learn:
X 1  1 1 1 1 1
[Q] How many input values?
t
X 2   1 1 1 1 1
t
X 3  1 1 1 1 1
t

Then, the learning step is
0
1
3

t
W   X i X i  3  I  3

i 1
3
1

1 3 3 1
0 1 1 3 
1 0? 3 1 

1 3 0 1
3 1 1 0 
The three states will be the stable states for the network.
Neural Network
43
0
1

Output vector

Yi  Sign(WX i   ),
WX 1  3

where  contains the thresholdes
3
1
for each of the five inputs
0
t
X 1  1 1 1 1 1
1

t
WX 2  3
X 2   1 1 1 1 1

t
3
X 3  1 1 1 1 1
1
0 1 3 3 1 
0
1 0 1 1 3 
1

3


t
WX 3  3
W
X i X i  3  I  3 1 0 3 1 



i 1
3
3 1 3 0 1 
1
1 3 1 1 0 

Neural Network
1
0
1
1
3
1
0
1
1
3
1
0
1
1
3
3 3 1  1 8 
1
1
1 1 3  1 6 

0 3 1  1  8   1  Y1  X 1
   

3 0 1  1 8 
1
1
1 1 0  1 6 
3 3 1   1  8 
1 1 3   1  6 
0 3 1   1   8   ?  Y2  ?
   
3 0 1   1  8 
1 1 0   1  6 
3 3 1  1   4 
1 1 3   1 0 
0 3 1  1    4   ?  Y3  ?
   
3 0 1  1   4 
1 1 0   1 0 
 1
44

Output vector
Yi  Sign(WX i  1)
X 1  1 1 1 1 1
t
X 2   1 1 1 1 1
t
X 3  1 1 1 1 1
t

Let
X 4  1 1 1 1 1
t


Then, Y4 ?
Let
X 5   1 1 1 1 1
t

0
1

WX 4  3

3
1
1
0
1
1
3
3
1
0
3
1
3
1
3
0
1
0
1

WX 5  3

3
1
1
0
1
1
3
3
1
0
3
1
3
1
3
0
1
1  1   2  1
3  1   4  1
1   1  8   1  Y4  X 1
     
1  1   2  1
0  1   4  1
1   1 ? 
3  1  ? 
1   1  ?   ?  Y5  ?
   
1  1  ? 
0  1  ? 
Then
Y5  1 1 1 1 1
t

Applied again, then Y5  X 1 . [Q] What does this mean?
Neural Network
45

Three steps





The network is trained to represent a set of attractors, or stable states.
Any input usually will be mapped to an output state that is the attractor
closest to the input.



Training weights with the attractor states as inputs– a storage or
memorization stage
Testing
Using the network in acting as a memory to retrieve data from its memory
The measure of distance is the Hamming distance that measures the
number of elements of the vectors that differ
Hence, the Hopfield network is a memory that usually maps an input
vector to the memorized vector whose Hamming distance from the
input vector is least.
In fact, not always converge to the state closest to the original input.
Neural Network
46

[Q] Applications of Hopfield networks?



[Q] How?
Demo



Pattern recognition
Pattern recognition:
http://www.cbu.edu/~pong/ai/hopfield/hopfieldapplet.html
[Q] How to implement the above demo?
[Q] How is the pattern recognition using Hopfield networks different
from the one that uses the backpropagation network in the seminar?
Neural Network
47

A Hopfield network is autoassociative – it can only associate an item
with itself or a similar one.
However, the human brain is fully associative, or heteroassociative,
which means one item is able to cause the brain to recall an entirely
different item.
E.g., Fall makes me think of autumn color.

[Q] How to improve Hopfield networks?


Neural Network
48
Bidirectional Associative Memories

A BAM (Bidirectional Associative Memory) is a heteroassociative
memory:



The network consists of two fully connected layers of nodes.



Like the brain, it can learn to associate one item from one set with
another completely unrelated item in another set.
Similar in structure to the Hopfield network
Every node in one layer is connected to every node in the other layer, not
to any node in the same layer.
In a Hopfield network, there is only one layer and all nodes are
interconnected.
The BAM is guaranteed to produce a stable output for any given
inputs and for any training data.
Neural Network
49



The BAM uses a sign activation function.
Two sets of data are to be learned, so that when an item from set X is
presented to the network, it will recall a corresponding item from set Y.
The weights matrix is
n
W   X iYi t
i 1
Neural Network
50
X 1  [1 1]t
X 2  [1 1]t
Y1  [1 1 1]t
Y2  [1 1 1]t
 2 2 2
W XY X Y 

2
2
2


t
1 1
t
2 2
Sign(W t X 1 )  Y1
[Q] Can you prove it?
Sign(WY1 )  X 1
Neural Network
51
Topics

[Q] Applications of BAM ?


Pattern recognition
Demo

Pattern recognition: http://www.cbu.edu/~pong/ai/bam/bamapplet.html
Neural Network
52
6. Kohonen Maps





Also called SOM (Self-Organizing Feature Map)
An unsupervised learning system. There is no training.
Finding the natural structure inherent in the input data
The objective of a Kohonen network is to map input vectors (patterns)
of arbitrary dimension N onto a discrete map with 1, 2 or 3
dimensions.
Patterns close to one another in the input space should be close to one
another in the map: they should be topologically ordered.
Neural Network
53


Two layers of nodes: an input layer and a cluster (output) layer.
Uses competitive learning, using winner-take-all:

Two layers








Input and cluster
The number of input nodes is equal to the size of input vectors.
The number of cluster nodes is equal to the number of clusters to find.
Input node is connected to every node in the cluster node.
Every input is compared with the weight vectors of each node in the
cluster node.
In the cluster layer, the node that most closely matches the input fires.
The node is called the winner. This is the clustering of the input.
Euclidean distance is used.
The winning node has its weight vector modified to be closer to the input
vector.
Neural Network
54

Learning process:


initialize the weights for each cluster unit
loop until weight changes are negligible

for each input pattern





present the input pattern
find the winning cluster unit (i.e., the most similar one to the input pattern)
find all units in the neighborhood of the winner
update the weight vectors for all those units
reduce the size of neighborhoods if required
Neural Network
55

The cluster node for which dj is the smallest is the winner
dj 
w
n
i 1
ij
2
 xi  , where x is the input vector and w i is the weight vector

Update of the weights of the winner
wij  wij    xi  wij  => The weights become similar to the input,
i.e., the winner becomes similar to the input.

The learning rate decreases over time.
In fact, a neighborhood of neurons around the winner are usually
updated together.
The radius defining the neighbor decreases over time.
The training phase terminates when the modification of weights
becomes very small.



Neural Network
56
Topics



It has been shown that while self-organizing maps with a small
number of nodes behave in a way that is similar to K-means, larger
self-organizing maps rearrange data in a way that is fundamentally
topological in character.
High dimensional data
to 3D ->
Applications?



Clustering
Dimension reduction; visualization of high dimensional data
Demo

http://www.superstable.net/sketches/som/
Neural Network
57
7. Hebbian Learning



Hebb’s law: “When an axon of cell A is near enough to excite a cell B
and repeatedly or persistently takes part in firing it, some growth
process or metabolic change takes place in one or both cells such that
A’s efficiency, as one of the cells firing B, is increased”.
Hence, if two neurons that are connected together fire at the same
time, the weights of the connection between them is strengthened.
Conversely, if the neurons fire at different times, the weight of the
connection between them is decreased.
i
xi
wij
j
yj
Neural Network
58


The activity product rule is used to modify the weights of a
connection between two nodes i and j that fire at the same time:
wij  wij    y j  xi
where  is the learning rate; xi is the input to node j from node i and yj
is the output of node j.
Hebbian networks usually also use a forgetting factor, which
decreases the weight of the connection between the two nodes if they
fire at different times.
wij  wij    y j  xi    y j  wij
i
xi
wij
j
yj
Neural Network
59
More interesting demo

Topics
BrainyAliens
Neural Network
60
8. Fuzzy Neural Networks

Topics
Weights are fuzzy sets

Takagi-Sugeno fuzzy rules
Neural Network
61
9. Evolving Neural Networks


Neural networks can be susceptible to local maxima.
Evolutionary methods (genetic algorithms) can be used to determine
the starting weights for a neural network, thus avoiding these kinds of
problems.
Neural Network
62
Other Learning Methods

Clustering







The K-Means
The Fuzzy C-Means
Classification


Topics
Naïve Bayes classifier
Support vector machine
Decision tree
Reinforcement learning
…
I will be back!
Neural Network
63