Download Document

Document related concepts

Gene expression programming wikipedia , lookup

Neural modeling fields wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Concept learning wikipedia , lookup

Machine learning wikipedia , lookup

Convolutional neural network wikipedia , lookup

Pattern recognition wikipedia , lookup

Catastrophic interference wikipedia , lookup

Transcript
Neural Networks
WHY ARTIFICIAL NEURAL NETWORKS?
• Characteristics of the human brain that are not present
in von Neumann or modern parallel computers include:
•
•
•
•
•
•
•
•
massive parallelism,
distributed representation and computation,
learning ability,
generalization ability,
adaptivety,
inherent contextual information processing,
fault tolerance, and
low energy consumption.
• It is hoped that devices based on biological neural
networks will possess some of these desirable
characteristics.
2
3
ANNs
• Inspired by biological neural networks, ANNs
are massively parallel computing systems
consisting of an extremely large number of
simple processors with many interconnections.
• ANN models attempt to use some
“organizational” principles believed to be used in
the human
4
Brief historical review
• ANN research has experienced three periods of
extensive activity:
• The first peak in the 1940s was due to McCulloch and Pitts'
• The second occurred in the 1960s with Rosenblatt's
perceptron convergence theorem and Minsky and Papert's
work showing the limitations of a simple perceptron. Minsky
and Papert's results dampened the enthusiasm of most
researchers which lasted almost 20 years.
• Since the early 1980s, ANNs have received considerable
renewed interest. The major developments include
• Hopfield's energy approach in 1982 and
• The back-propagation learning algorithm for multilayer perceptrons
(multilayer feed forward networks) first proposed by Werbos, and then
popularized by Rumelhart et al. in 1986.
5
Biological neural networks
• A neuron (or nerve cell) is a special biological
cell that processes information. It is composed of
a cell body, or soma, and two types of outreaching tree-like branches: the axon and the
dendrites.
6
Biological neural networks (cont.)
• A neuron receives signals (impulses) from other
neurons through its dendrites (receivers) and
transmits signals generated by its cell body along
the axon (transmitter), which eventually
branches into strands and sub strands.
• At the terminals of these strands are the
synapses.
• A synapse is an elementary structure and
functional unit between two neurons (an axon
strand of one neuron and a dendrite of another)
7
Biological neural networks (cont.)
• The human brain contains about 1011 neurons,
which is approximately the number of stars in
the Milky Way.
• Neurons are massively connected, much more
complex and dense than telephone networks.
• Each neuron is connected to 103 to l04 other
neurons.
• In total, the human brain contains approximately
1014 to 1015 interconnections.
8
Biological neural networks (cont.)
• Complex perceptual decisions such as face
recognition are typically made by humans within
a few hundred milliseconds.
• These decisions are made by a network of
neurons whose operational speed is only a few
milliseconds. This implies that
• the computations cannot take more than about 100
serial stages.
• the brain runs parallel programs that are about 100
steps long for such perceptual tasks. This is known as
the hundred step rule
9
Computational models of neurons
• This mathematical neuron computes a weighted
sum of its n input signals ,x,, j = 1,2, . . . , n.
• Generates an output of 1 if this sum > certain
threshold U. Otherwise, an output of 0 results.
10
• Mathematically
(.) is the unit step function
wj is the synapse weight
associated with the jth input
For simplicity of notation, we often consider the
threshold U as another weight wo = - U attached
to the neuron with a constant input x0 = 1
11
Activation Functions
12
The Sigmoid
• The standard sigmoid function is the logistic
function, defined by
where  is the slope
parameter
13
Network architectures
• ANNs can be viewed as weighted directed
graphs in which artificial neurons are nodes and
directed edges (with weights) are connections
between neuron outputs and neuron inputs.
• feed-forward networks, in which graphs have no
loops, and
• recurrent (or feedback) networks, in which loops
occur because of feedback connections.
14
Network architectures
Different connectivity's yield different network behaviors
15
Network architectures
• Feed-forward networks are
• static, that is, they produce only one set of output values
rather than a sequence of values from a given input.
• memory-less in the sense that their response to an input is
independent of the previous network state.
• Recurrent, or feedback, networks are
• dynamic systems.
• When a new input pattern is presented, the neuron outputs are
computed. Because of the feedback paths, the inputs to each
neuron are then modified, which leads the network to enter a
new state.
• Different network architectures require appropriate learning
algorithms.
16
Learning
• A learning process in the ANN context can be
viewed as the problem of updating network
architecture and connection weights so that a
network can efficiently perform a specific task.
• The network usually must learn the connection
weights from available training patterns.
• Performance is improved over time by iteratively
updating the weights in the network.
17
Learning
• ANNs' ability to automatically learn from
examples makes them attractive and exciting.
• ANNs appear to learn underlying rules (like
input-output relationships) from the given
collection of representative examples.
• This is one of the major advantages of neural
networks over traditional expert systems.
18
Learning algorithm
•
To understand or design a learning process,
you must have
1. A learning paradigm: a model of the environment
in which a neural network operates, i.e., you must
know what information is available to the network.
2. Learning rules: you must understand how network
weights are updated, i.e., which learning rules
govern the updating process.
•
A learning algorithm refers to a procedure in
which learning rules are used for adjusting the
weights.
19
Learning paradigms
• Supervised learning: The network is provided with a correct
answer (output) for every input pattern - learning with a
“teacher”.
• Weights are determined to allow the network to produce answers as close
as possible to the known correct answers.
•
Reinforcement learning is a variant of supervised learning in which the
network is provided with only a critique on the correctness of network
outputs, not the correct answers themselves.
• Unsupervised learning: The network explores the underlying
structure in the data, or correlations between patterns in the data,
and organizes patterns into categories from these correlations learning without a teacher.
• Hybrid learning: Part of the weights are usually determined
through supervised learning, while the others are obtained
through unsupervised learning - combines supervised and
unsupervised learning.
20
Learning theory
• Learning theory must address three fundamental and practical
issues associated with learning from samples: capacity, sample
complexity, and computational complexity.
• Capacity: how many patterns can be stored, and what functions
and decision boundaries a network can form.
• Sample complexity: determines the number of training patterns
needed to train the network to guarantee a valid generalization.
• Too few patterns may cause “over-fitting” (wherein the network performs
well on the training data set, but poorly on independent test patterns drawn
from the same distribution as the training patterns).
• Computational complexity: refers to the time required for a
learning algorithm to estimate a solution from training patterns.
• Many existing learning algorithms have high computational complexity.
21
Learning rules
• Error correction, Boltzmann, Hebbian, and Competitive
learning.
• ERROR-CORRECTION RULES: During the
learning process, the actual output y generated by the
network may not equal the desired output d.
• The basic principle of error-correction learning rules is to use
the error signal (d-y) to modify the connection weights to
gradually reduce this error.
• The perceptron learning rule is based on this errorcorrection principle.
• A perceptron consists of a single neuron with adjustable
weights, wj, j = 1,2, . . . , n, and threshold U (threshold
function).
22
ERROR-CORRECTION RULES
• Given an input vector x= (xl, x2, . . . , xn)t, the net
input to the neuron is
• The output y of the perceptron
is + 1 if v > 0, and 0 otherwise.
• In a two-class classification problem, the
perceptron assigns an input pattern to one class if
y = 1, and to the other class if y=0.
• The linear equation defines the decision
boundary that halves the space.
23
Perceptron learning algorithm
1) Randomly initialize weights and threshold w1
w2 … wm
2) Present an input vector x= (xl, x2, . . . , xn)t and
evaluate the output of the neuron.
3) Update the weights according to
wj (t + 1) = wj (t) +  (d-y) xj
where d is the desired output, t is the iteration
number, and  is the gain step size ( 0.0 <  <
1.0)
24
Perceptron learning algorithm
• Note that learning occurs only when the perceptron makes an
error.
• The perceptron convergence theorem: Rosenblatt proved that
when training patterns are drawn from two linearly separable
classes, the perceptron learning procedure converges after a finite
number of iterations.
• In practice, you do not know whether the patterns are linearly
separable.
• Many variations of this learning algorithm have been proposed in
the literature
• Other activation functions that lead to different learning
characteristics can also be used.
• The back-propagation learning algorithm is based on the errorcorrection principle.
25
Perceptrons and Boolean Functions
If inputs are all 0’s and 1’s and outputs are all 0’s
and 1’s…
• Can learn the function x1  x2
X2
X1
• Can learn the function x1  x2 .
26
Perceptrons and Boolean Functions
• What about the exclusive or function?
f(x1,x2) = x1  x2 =
(x1  ~x2)  (~ x1  x2)
27
XOR problem
Desired: make an ANN which will produce Y = X1 xor
X2 on inputs X1 and X2.
Problem: there is no single line that can “cut” X1 X2
space into two proper regions. Therefore, cannot use a
single-layer neural net.
Solution: use multilayer network
28
HEBBIAN RULE
• The oldest learning rule is Hebb’s postulate of learning.
Hebb based it on the following observation from
neurobiological experiments:
• If neurons on both sides of a synapse are activated
synchronously and repeatedly, the synapse’s strength is
selectively increased.
• Mathematically, the Hebbian rule can be described as
where xi and yj are the output values of neurons i and j,
respectively, which are connected by the synapse wij
and  is the learning rate. Note that xi is the input to the
synapse.
29
HEBBIAN RULE
• An important property of this rule is that
• learning is done locally, i.e., the change in synapse
weight depends only on the activities of the two
neurons connected by it.
• This significantly simplifies the complexity of the
learning circuit in a VLSI implementation.
30
HEBBIAN RULE
• A single neuron trained using the Hebbian rule exhibits an
orientation selectivity.
• The points depicted are drawn from a two-dimensional
Gaussian distribution and used for training a neuron.
• The weight vector of the neuron is initialized to w0.
• As the learning proceeds, the weight vector moves
progressively closer to the
direction w of maximal
variance in the data.
• w is the eigenvector of the
covariance matrix of the data
corresponding to the largest
eigen value.
31
BOLTZMANN LEARNING
• The Boltzmann machine (named in honour of a 19thcentury scientist by its inventors)
• Boltzmann machines are symmetric recurrent networks
consisting of binary units (+ 1 for “on” and -1 for “off’).
• the weight on the connection from unit i to unit j is equal to
the weight on the connection from unit j to unit i.
• A subset of the neurons, called visible, interact with the
environment; the rest, called hidden, do not.
• Each neuron is a stochastic unit that generates an output
(or state) according to the Boltzmann distribution of
statistical mechanics.
32
• Boltzmann machines operate in two modes:
• Clamped: visible neurons are clamped onto specific
states determined by the environment; and
• Free-running: both visible and hidden neurons are
allowed to operate freely. The hidden neurons always
operate freely.
K is the number of visible neurons
L is the number of hidden neurons.
33
BOLTZMANN LEARNING
• Boltzmann learning is a stochastic learning rule derived from
information-theoretic and thermodynamic principles.
• The objective of Boltzmann learning is to adjust the connection
weights so that the states of visible units satisfy a particular
desired probability distribution.
• According to the Boltzmann learning rule, the change in the
connection weight wg is given by
where  is the learning rate, and ij and ij are the correlations
between the states of units i and j when the network operates in
the clamped mode and free-running mode, respectively.
34
Summary of the Boltzmann Machine Learning
Procedure
1. Initialization: set weights to random numbers in
[–1,1]
2. Clamping Phase: Present the net with the
mapping it is supposed to learn by clamping
input and output units to patterns. For each
pattern, perform simulated annealing on the
hidden units at a sequence T0, T1, ..., Tfinal of
temperatures. At the final temperature, collect
statistics to estimate the correlations
35
Summary of the Boltzmann Machine Learning
Procedure
3. Free-Running Phase: Repeat the calculations
performed in step 2, but this time clamp only the
input units. Hence, at the final temperature,
estimate the correlations
4. Updating of Weights: update them using the
learning rule Where  is a learning rate
parameter.
36
Summary of the Boltzmann Machine Learning
Procedure
5. Iterate until Convergence: Iterate steps 2 to 4
until the learning procedure converges with no
more changes taking place in the synaptic
weights wji for all j, i.
37
38
Alternative Boltzmann Architecture
• Alternatively, the visible units may be viewed as divided into
input and output units.
• In this case the Boltzmann machine performs association under
the supervision of a teacher, with the input units receiving
information form the environment, and the output units reporting
the outcome for that input pattern.
39
Boltzmann vs Hopfield
Similarities:
1. Processing units have binary states (±1)
2. Connections between units are symmetric
3. Units are picked at random and one at a time for
updating
4. Units have no self-feedback.
Differences:
1. Boltzmann machine permits the use of hidden neurons.
2. Boltzmann machine uses stochastic neurons with a
probabilistic firing mechanism, whereas the standard
Hopfield net uses neurons based on the McCulloch-Pitts
model with a deterministic firing mechanism.
3. Boltzmann machine may also be trained by a
probabilistic form of supervision.
40
COMPETITIVE LEARNING RULES
• Competitive-learning output units compete
among themselves for activation. As a result,
only one output unit is active at any given time.
This phenomenon is known as winner-take-all.
• Competitive learning has been found to exist in
biological neural network.
• Competitive learning often clusters or
categorizes the input data. Similar patterns are
grouped by the network and represented by a
single unit. This grouping is done automatically
based on data correlations.
41
COMPETITIVE LEARNING RULES
• The simplest competitive learning network
consists of a single layer of output units.
• Each output unit i in the network connects to all
the input units (xi ,s) via weights, wij , j= 1,2, . . . ,
n.
• Each output unit also connects to all other output
units via inhibitory weights but has a self-feed
back with an excitatory weight.
42
COMPETITIVE LEARNING RULES
• A simple competitive learning rule can be stated
as
• Note that only the weights of the winner unit get
updated.
• The effect of this learning rule is to move the
stored pattern in the winner unit (weights) a little
bit closer to the input pattern.
• Assume that all input vectors have been
normalized to have unit length.
• The weight vectors of the three units are
randomly initialized. Their initial and final
positions on the sphere after competitive learning
are marked as Xs.
• Each of the three natural groups (clusters) of patterns has been discovered by
an output unit whose weight vector points to the center of gravity of the
43
discovered group.
COMPETITIVE LEARNING RULES
• You can see from the competitive learning rule that the
network will not stop learning (updating weights) unless
the learning rate q is 0.
• A particular input pattern can fire different output units
at different iterations during learning.
• The system is said to be stable if no pattern in the
training data changes its category after a finite number
of learning iterations.
• One way to achieve stability is to force the learning rate to
decrease gradually as the learning process proceeds towards 0.
However, this artificial freezing of learning causes another
problem termed plasticity, which is the ability to adapt to new
data. This is known as Grossberg’s stability- plasticity
dilemma in competitive learning.
44
COMPETITIVE LEARNING RULES
• The most well-known example of competitive learning
is vector quantization for data compression.
• It has been widely used in speech and image processing
for efficient storage, transmission, and modeling.
• Its goal is to represent a set or distribution of input
vectors with a relatively small number of prototype
vectors (weight vectors), or a codebook. Once a
codebook has been constructed and agreed upon by both
the transmitter and the receiver, you need only transmit
or store the index of the corresponding prototype to the
input vector.
• Given an input vector, its corresponding prototype can
be found by searching for the nearest prototype in the
codebook.
45
Well known learning algorithms
46
Well known learning algorithms
47
SUMMARY
• Learning rules based on error-correction can be used for
training feed-forward networks
• Hebbian learning rules have been used for all types of
network architectures.
• Each learning algorithm is designed for training a
specific architecture.
• When we discuss a learning algorithm, a particular
network architecture association is implied.
• Each algorithm can perform only a few tasks well.
• Other algorithms, including Adaline, Madaline, linear
discriminant analysis, Sammon's projection , and
principal component analysis.
48
Multilayer Networks
The class of functions representable by perceptrons is
limited



Out(x)  g w x   g   w j x j 
 j




Out(x)  g  W j g   w jk x jk  
 k

 j
This is a nonlinear function
Of a linear combination
Of non linear functions
Of linear combinations of inputs
49
A 1-HIDDEN LAYER NET
NINPUTS = 2
NHIDDEN = 3
 N INS

v1  g   w1k xk 
 k 1

w11
x1
w21
w31
w12
x2
w1
w22
 N INS

v2  g   w2 k xk 
 k 1

w2
 N HID

Out  g  Wk vk 
 k 1

w3
w32
 N INS

v3  g   w3k xk 
 k 1

50
OTHER NEURAL NETS
51
Multilayer perceptron
• The most popular class of multilayer feed-forward
networks is multilayer perceptrons
• Each computational unit employs either the
thresholding function or the sigmoid function.
• Multilayer perceptrons can form arbitrarily complex
decision boundaries and represent any Boolean
function.
• The development of the back-propagation learning
algorithm for determining weights in a multilayer
perceptron has made these networks the most popular
among researchers and users of neural networks.
52
Multilayer perceptron
• We denote wij(l) as the weight on the connection
between the ith unit in layer (l-1) to jth unit in layer l.
• Let {(x(1), d(1)), (x(2), d(2)), . . . , (x(p), d(p))} be a set of p
training patterns (input-output pairs),
where x(i)  Rn is the input vector in the n-dimensional
pattern space, and
d(i)  [0, l] m, an m-dimensional hypercube.
For classification purposes, m is the number of classes.
The squared error cost function most frequently used in
the ANN literature is defined as
53
Back-propagation
The back-propagation algorithm is a gradientdescent method to minimize the squared-error
cost function E.
54
GRADIENT DESCENT
Suppose we have a scalar function
f(w) :   
We want to find a local minimum.
Assume our current weight is w
GRADIENT DESCENT RULE:

w  w 
f w
w
η is called the LEARNING RATE. A small positive
number, e.g. η = 0.05
55
Gradient Descent in “m” Dimensions
m
f(w
)
:


Given
 



f
w


 w1

 f w   

 points in direction of steepest
 

ascent.
 w f w 
 m

f w  is the gradient in that direction
GRADIENT DESCENT
RULE:
w  w -f w 
….where wj is the jth weight

Equivalently w j  w j - η
f w 
w j
“just like a linear feedback
56
system”
 i  yi  w x i
w j  w j   i xij

A RULE KNOWN BY
MANY NAMES
57
Back-propagation algorithm
1. Initialize
the weights to small random variables
2- Randomly choose an input pattern X(u)
3- propagate the signal forward through the network
4- Compute iL in the output layer (Oi = yiL)
il = g’ (hil) [diu – yil],
where hil represents the net input to the ith unit in the lth layer, g’ is the
derivative of the activation function g.
5- Compute the deltas for the preceding layers by propagating the errors
backwards;
il = g’ (hil) j wijl+1 jl+1 ,
for l = (L-1),…, 1
6- Update weights using
wjil = il yjl-1
7- Go to step 2 and repeat for next pattern until the error in the output layer is
acceptably low, or a prespecified number of iterations is reached.
58
Backpropagation algorithm (instance-based)
1 Randomize the weights {ws} to small random values (both positive and
negative) to ensure that the network is not saturated by large values of
weights.
2 Select an instance t, that is the vector {xk(t)}, i = 1,...,Ninp (a pair of input
and output patterns), from the training set.
3 Apply the network input vector to network input.
4 Calculate the network output vector {zk(t)}, k = 1,...,Nout.
5 Calculate the errors for each of the outputs k , k=1,...,Nout, the difference
between the desired output and the network output:
(for simplicity we will denote it as simply E).
6 Calculate the necessary updates for weights -ws in a way that minimizes
this error (discussed below).
7 Adjust the weights of the network by -ws.
8 Repeat steps 2 – 6 for each instance (pair of input–output vectors) in the
training set until the error for the entire system (error E defined above or
the error on cross-validation set) is acceptably low, or the pre-defined
number of iterations is reached.
59
Backpropagation algorithm
• Often it is reasonable not to update weights immediately
after processing each instance, but accumulates (sums
up) the necessary changes across a subset of training
instances (call an epoch) and only then updates the
weights. This allows for faster convergence (Smith
1993).
• Epoch can be the part or the whole training set. After
the whole training set is processed (this sequence of
steps is called an iteration),
• the whole process is repeated again in an iterative
fashion – until the total error is acceptably low.
• Number of such iterations may sometimes be as high as
several thousand.
60
Backpropagation algorithm
(epoch-based, with cumulative updates)
1 – 6 as above
7 add up the calculated weights’ updates {-ws} to the
accumulated total updates {ΔWs}.
8 Repeat steps 2 – 7 for several instances comprising an
epoch.
9 Adjust the weights {ws} of the network by the updates {Ws}.
10 Repeat steps 2 – 9 until all instances in the training set
are processed. This constitutes one iteration.
11 Repeat the iteration of steps 2 – 10 until the error for
the entire system (error E defined above or the error on
cross-validation set) is acceptably low, or the predefined number of iterations is reached.
61
62
Backpropagation
• In a Single-layer network,
Each neuron adjusts its weights according to
what output was expected of it, and the output it
gave. This can be mathematically expressed by
the Perceptron Delta Rule:
Where w is the array of weights,
x is the array of inputs.
63
The Sigmoid (logistic) function
• One of the more popular alternatives function
used with back-propagation nets is the Sigmoid
(logistic) function.
64
The perceptron learning rule
Where w is the array of weights,
x is the array of inputs, and  is defined as the learning
rate.
yi and di are the actual and desired outputs,
respectively.
• Calculating the deltas for the output layer as:
65
Calculate delta for the hidden layers
• We have to know the effect on the output of the
neuron if a weight is to change.
• Therefore, we need to know the derivative of the
error with respect to that weight.
• It has been proven that for neuron q in hidden
layer p, delta is:
Each delta value for hidden layers require that the delta
value for the layer after it be calculated.
66
Backpropagation example
• NINPUTS = 2
1
NHIDDEN = 2
W1(0,1)
W1(0,2)
W1(1,1)
x1
1
W2(1,1)
W2(0,1)
W2(0,1)
W1(1,2)
W1(2,1)
x2
 N INS

v1  g   w1k x k 
 k 0

 N INS

v 2  g   w2 k x k 
 k 0

 N HID

Out  g  Wk vk 
 k 1

W2(2,1)
W1(2,2)
67
Back propagation algorithm
1-Initialize the weights to small random variables
Layer 1
Layer 2
68
Randomly choose an input pattern X(u)
3- Propagate the signal forward through the
network
Layer 1
X2(i) = ∑k=0,1,2 Wi (k,i) X(k)
69
Out(x) = g(∑k=0,1,2 Wi (k,i) X(k) )
X2(i) = g(∑k=0,1,2 Wi (k,i) X(k) )
g(x) = 1/(1+e-x)
70
4. Compute iL in the output layer (Oi = yiL)
il = g’ (hil) [diu – yil],
where hil represents the net input to the ith unit in
the lth layer, g’ is the derivative of the activation
function g.
d3(1) = x3(1)(1 - x3(1))(d - x3(1))
71
5- Compute the deltas for the preceding
layers by propagating the errors backwards;
il = g’ (hil) j wijl+1 jl+1 ,
for l = (L-1),…, 1
72
6- Update weights using
wjil = il yjl-1
Taking  as 0.05
dw2(0,1) = *x1(0)*d2(1)
73
7- Go to step 2 and repeat for next pattern
until the error in the output layer below a
prespecified number of iterations is
reached.
74
• Run the entire process again on the next set of
training data.
• Slowly, as the training data is fed in and the
network in retrained a few thousand times, the
network could balance out to certain values.
75
APPLICATIONS
• To successfully work with real-world problems,
you must deal with numerous design issues,
including network model, network size,
activation function, learning parameters, and
number of training samples.
•
•
•
•
•
•
•
Pattern classification
Clustering
Function approximation
Prediction
Optimization
Content addressable memory
Control
76
Reference
77