Download NNIntro

Document related concepts

Caridoid escape reaction wikipedia , lookup

Artificial intelligence wikipedia , lookup

Molecular neuroscience wikipedia , lookup

Neural engineering wikipedia , lookup

Neuroanatomy wikipedia , lookup

Neurotransmitter wikipedia , lookup

Feature detection (nervous system) wikipedia , lookup

Mirror neuron wikipedia , lookup

Stimulus (physiology) wikipedia , lookup

Nonsynaptic plasticity wikipedia , lookup

Optogenetics wikipedia , lookup

Neural coding wikipedia , lookup

Development of the nervous system wikipedia , lookup

Holonomic brain theory wikipedia , lookup

Machine learning wikipedia , lookup

Central pattern generator wikipedia , lookup

Channelrhodopsin wikipedia , lookup

Donald O. Hebb wikipedia , lookup

Gene expression programming wikipedia , lookup

Sparse distributed memory wikipedia , lookup

Single-unit recording wikipedia , lookup

Time series wikipedia , lookup

Artificial neural network wikipedia , lookup

Metastability in the brain wikipedia , lookup

Neuropsychopharmacology wikipedia , lookup

Neural modeling fields wikipedia , lookup

Synaptic gating wikipedia , lookup

Catastrophic interference wikipedia , lookup

Convolutional neural network wikipedia , lookup

Biological neuron model wikipedia , lookup

Backpropagation wikipedia , lookup

Nervous system network models wikipedia , lookup

Recurrent neural network wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
An Introduction to Artificial
Neural Networks
Piotr Golabek, Ph.D.
Radom Technical University
Poland
[email protected]
An overview of the lecture
• What are ANN’s? What are they for?
• Neural networks as inductive machines –
inductive reasoning tradition
• The evolution of the concept – keywords,
structures, algorithms
An overview of the lecture
• Two general tasks: classification and
approximation
• Above tasks in more familiar setting –
decision making, signal processing, control
systems
• + live presentations
What are ANNs?
• Don’t ask me ...
• „ANN is a set of processing elements
(PE’s), influencing each other”
• (that definition suit almost everything...)
What are ANN’s
... but seriously...
• „neural” – following biological
(neurophysiological) inspiration,
• „artificial” – don’t forget these are not real
neurons!
• „networks” – strongly interconnected (in fact –
massive parallel processing)
and the implicit meaning:
• ANNs are „learning machines”, i.E. adapt, just as
biological neurons do
Machine learning
• Important field of AI
• „A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by
P, improves with experience E”
(Take a look at „Machine Learning” by Tom Mitchell)
What is ANN?
• In case of ANNs, the „Experience” is input
data (examples)
• The ANN is a „inductive learning machine”,
i.E. machine constructing internal
generalized concepts based on evidence
brought by data stream
• ANN learns from examples – a paradigm
shift
What is ANN
• Structurally, ANN is a complex,
interconnected structure composed of
simple processing elements, often
mimicking biological neurons
• Functionally, ANN is an inductive learning
machine, it is able to undergo an adaptation
process (learning) driven by examples
What are ANNs used for?
• Recognition of images, OCR
• Recognition of time signal signatures – vibration
diagnostic, sonar signal interpretation, detection
intrusion patterns in various transaction systems
• Trend prediction, esp. in financial markets (bond
rating prediction)
• Decision support, eg. in credit assessment,
medical diagnosis
• Industrial process control, eg. the melting
parameters in metallurgical processes
• Adaptive signal filtering to restore the information
from corrupted source
Inductive process
• Concepts rooted in epistemology (episteme –
knowledge)
• Heraclitus – „The nature likes to hide”
• Observations vs the true nature of the
phenomenon
• The empiric (experimental) method of developing
the model (hypothesis) of the true phenomenon –
the inductive process
• Something like this goes on during ANN learning
ANN as inductive learning
machine
• „The theory” – the way ANN behaves
• „Experimental data” – examples the ANN
learns from
• New examples cause the ANN to change it’s
behaviour, in order to fit better to the
evidence brought by examples
Inductive process
• Inductive bias - the „initial” theory (a priori
knowledge)
• Variance – the „evidence” brought by data
• The strong bias prevents the data to affect the
theory
• The weak bias makes the theory vulnerable to the
data corruption
• The game is to properly set the bias-variance
balance
ANN as inductive learning
machines
• We can shape the inductive bias of learning
process e.g. by tuning the number of
neurons
• The more neurons, the more flexible the
network (the more sensitive to data)
Inductive vs deductive reasoning
• Reasoning:
premises  conclusions
• Deductive reasoning – the conclusions are more
specific than premises (we just reason the
consequences)
• Inductive reasoning – the conclusions are more
general than premises (we reason the general rules
governing the phenomenon from the specific
examples)
The main goal of inductive
reasoning
• The main goal:
To achive the good generalization – to
reason the rule general enough, that it fits to
any futer data
• This is also the main goal of machine
learning – to use the experience in order to
build good enough performance (in every
possible future situation)
McCulloch-Pitts model
Warren McCulloch
Walter Pitts
„A Logical Calculus Immanent in Nervous Activity”, 1943
McCulloch-Pitts model
Logical calculus approach:
• elementary logical operations: AND, OR, NOT
• basic reasoning operator, implication
pq
(given premises p, we draw conclusion q)
McCulloch-Pitts model
• Logical operators are functions
Truth tables:
x
y
x AND y
x
y
0
0
0
0
0
0
x
0
1
0
0
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
1
xy
x
y
0
0
1
1
0
1
1
0
1
0
0
1
1
1
x OR y
NOT x
x y

NOT x  OR y
McCulloch-Pitts model
• The working question: whether a neuron
can perform logical functions AND, OR,
NOT
• If the answer is yes, the chain of
implications (reasoning) could be
implemented in neural network
McCulloch-Pitts
model
Inputs
Weights
Neuron output
(activation)
Summation
Total
exicitation
Activation
Activation
function
threshold
w1McCulloch-Pitts
x1  w2 x2  wn xn transfer function
w1 x1  w2 x2 
wn xn   ?
Implementation of AND, OR,
NOT
• McCulloch-Pitts neuron
Including threshold into weights
McCulloch-Pitts model
• Neuron equations
z  w1 x1  w2 x2    wn xn
a  f z 
n
z   wi xi
i 1
z  w x
T
(vector dot product)
w1
w2  wn1
wn 
 xn 
 
 xn 1 
 
 
 x2 
 
 x1 
(vector dot product)
x
α
w
x  w  x  w  cos 
x
x
w
max similarity
x
w
max „antisimilarity”
w
max dissimilarity
(orthogonality)
Vector dot product interpretation
• Inputs are called „input vector”
• weights are called „weight vector”
• Neuron excites, when input vector is similar
enough to the weight vector
• Weight vector is a „template” for some set of
input vectors
Neurons – elements of the ANNs
Don’t be fooled...
These are our neurons ...
Neurons – elements of the ANNs
Single neuron (stereoscopic)
Neurony - elementy składowe
sieci neuronowych
There is some analogy...
The real neuron
Synaptic connection – organic structure
The real neuron
Synaptic connection – the molecular level
McCulloch-Pitts model
• The conclusion:
If we tune the weights of the neuron properly,
we can make it implement the transfer
function we need (AND, OR, NOT)
• The question:
What the weights of neurons are tuned in our
brains, what is the adaptation mechanism
Adaptacja neuronu
• Donald Hebb (1949, neurophysiologist):
“When an axon of cell A is near enough to
excite a cell B and repeatedly or
persistently takes part in firing it, some
growth process or metabolic change takes
place in one or both cells such that A’s
efficiency, as one of the cells firing B, is
increased.”
Hebb rule
wij   x j yi
Hebb rule
• It is a local rule of adaptation
• The multiplication of input and output signifies a
correlation between them
• The rule is unstable – a weight can grow without
limits
(that doesn’t happen in nature, where there are
limited resources)
• numerous modifications of the Hebb rule has been
proposed, to make it stable
Hebb rule
• Hebb rule is very important and useful ...
• ... but for now we want to make the neuron
to learn the function we need
Rosenblatt Perceptron
• Frank Rosenblatt (1958) – Perceptron – hardware
(electromechanical) implementation of the ANN
(effectively – 1 neuron).
Rosenblatt Perceptron
• One of the goals of the experiment was to train the
neuron, i.E. to make it go active whenever specific
pattern appears on „retina”
• The neuron was to be trained with examples
• The experimenter („teacher”) was to expose the
neuron to the different patterns and in each case
tell it, whether it should fire, or not
• The learning algorithm should do best to make
neuron do what the teacher requires
Perceptron learning rule
Kind of Hebbian rule modification (weight
correction depends on the error between
actual and desired output)
wk   xk y
wk   xk d  y 
Supervised scheme
Supervised scheme
• One training example – the pair <input
value, desired output> is called a training
pair
• The set of all the training pairs is called
„training set”
Unsupervised scheme
Example of supervised learning
• Linear Associator
Neural networks
• „A set of processing elements implementing
each other”
• The neurons (PEs) are interconnected. The
output of each neuron can be connected to
the input of every neuron, including itself
Neural networks
• If there is a path of propagation (direct or
indirect) between the output of a neuron and
its own input, we have feedbacks - such
structures are called „recurrent”
• If there is no feedback in a network, such
structure is called „feedforward”
What does recurrent mean?
• recurrent definition is a definition of a
concept is a definition using the very same
concept (but perhaps in lower complixity
setup)
• recurent function is a function calling itself
• classical recurrent definition – factorial
function:
n! nn 1!, 0! 1
Recurrent connection
• „function calling itself”
y=f(z)
  
     
y t  f z t  f h y t 1 ,  f h f z t 1 ,  
Recurrent connection
• At any given moment, the whole history of
past excitations influences neuron output
• The concept of temporal memory emerges
• The past influences present to the degree
determined by the weight of the recurrent
connection
• This weight is effectively a „forgetting
factor”
Feedforward layered network
Our brain
• There are ca 1011 neurons in our brain
• Each of them is connected on averege to 1000
other neurons
• There is only one connection per 10 billions of
other
• If every neuron would be connected to each other,
our brain would have to be a few hundred meters
in diameter
• There is a strong modularity
Our brain
A fragment af the neural network connecting retina to the
visual perception area of the brain
Our brain vs computers
• The memory size estimation – ca. 1014
connections – gives an estimated size 100TB (each
connection has a continous real weight)
• Neurons are quite slow, capable of activating no
more than 200 times per second, but there are a lot
of them, that gives an estimate of 1016 floating
point operations per second.
Neural networks vs computer
•Many (1011) simple processing
elements (neurons)
• Massively parallel, distributed
processing
• The momory evenly distributed in
the whole structure, content
addressable
• Large fault tollerance
• A few complex processing elements
• Sequential, centralized processing
• Compact, addressed by an index
memory
•Large fault vulnerability
How to train the whole network?
• For the Perceptron – the output of the neuron
could be compared to the desired value
• But what with the layered structure? How to reach
the hidden neurons?
• The original idea comes from experiments of
Widrow and Hoff in 60s
• The global error optimization using gradient
descent has been used
Supervised scheme once again
Error minimization
• The error function component can be quite
elaborately defined
• But the goal is always to minimize the error
• One widely used technique of function
optimization (minimization/maximization)
is called gradient descent
Error function
• One cycle of training consists of the
presentation of many training pairs – it is
called one epoch of learning
• The error accumulated for the whole epoch
is an average:
1
E w  
N
N
 d
i 1
 yi  , yi  F x; w 
2
i
Why quadratic function?
Error function once again
1
E w  
N
N
 d
i 1
 yi  , yi  F x; w 
2
i
• As subsequent input/output pairs are
„averaged out”, we can think of the error
function mainly as a function of weights w
• The goal of learning – to choose weights in
such way, that the error would be minimized
gives us
The functionDerivative
is
information
on whether
Error
derivative
falling, function
then
the
We
have
totoact
We
want
or
sign of the
the function increases
„against”
thethe
sign
minimize
decreases when the
derivative is
of the derivative.
function
value, thus
argument
increases
(and
negative
we have to increase
how fast)
the argument.
wi
wi
The gradient rule
E
wi  
wi
Error function gradient
• In multidimensional case we have to do
with a vector of error function partial
derivatives with respect to each dimension
(gradient):
 E E
g ,
,
 x1 x2
E 
,

xn 
Gradient method
E
w2
The metod of moving
„against” the gradient is
commonly called „hillclimbing”
w1
Gradient method
Steepest descent demo
• MATLAB demonstration
Other form of activation function
• So called sigmoidal function, e.g.:
1
f z  
1  e  z
Other form of activation function
β=1
β=100
β=0.4
Backpropagation algorithm
wij 
E
wij
Δwij?
Backpropagation algorithm
E
E

II
II

y

z
i
i E
E
III
II

w
ij
wki
E
E
III
z
y
Chain rule
• Applies chain rule of differentiation:
y  g  w
f
f y

w y w
That makes possible to „transfer” the error
backward toward hidden units
Chain rule
Backward propagation through neuron:
E
z a  f (z )
E
a
Backpropagation through neuron
1
a  f ( z) 
1  ez
E E a

z a z
Backpropagation through neuron
a
1
'
 f z   
z
1  e z


2
 e  z   1 


1
e z
1
1  ez 1



z
z
z
z
1 e 1 e
1 e
1 e

1

1  ez



 

1
1 
z
1

e



 


  f z 1  f z   a1  a 

Backpropagation through neuron
E E a E


a1  a 
z a z a
Backpropagation through neuron
E
w2
E
z
Backpropagation through neuron
m
z  f ( wi )   wk xk  w1 x1  w2 x2   wi xi   wm xm
k 1
E E z

wi z wi
Backpropagation through neuron
z  w1 x1  w2 x2   wi xi   wm xm 

 xi
wi
wi
Backpropagation through neuron
E E z E


xi
wi z wi z
Backpropagation through neuron
E E
E a
E

xi 
xi 
aa  1xi
wi z
a z
a
Backpropagation through neuron
• Conclusion: if we know the error function
gradient with respect to the output of the
neuron, we can compute the gradient with
respect to each of it’s weights
• In general, our goal is to propagate the error
function gradient from the output of the
network to the outputs of the hidden units
Backpropagation
• Additional problem: in general, each hidden
neuron is connected to more than one
neuron of the next layer
• There are many paths for the error gradient
to be transmitted backward from the next
layer
Error backpropagation
E
is a „sensitivity” of E
wij to the change of w E
ij
z1II
E
z 2II
Backpropagation through layer
• Applying the rule of derivation for function
of compound arguments:
f r x , sx ,
f r x  f sx 



x
r x  x
sx  x
we can propagate the error gradient through
the layer
Backpropagation through layer
z1(aj)
aj
z2(aj)
E z1 a j , z2 a j 
a j
E z1 E z2


z1 a j z2 a j
Backpropagation through layer
a1
w11 z
1
w12
a2
w13
a3
z1  w11a1  w12a2  w13a3
Backpropagation through layer
z1  w11a1  w12a2  w13a3
z1
 w12
a2
Ogólniej:
zi  wi1a1  wi 2 a2    wij a j   wimam
zi
 wij
a j
Backpropagation through layer
E z1 , z2 ,, zi ,, zn 

a j
E
E
E
E

w1 j 
w2 j   
wij   
wnj 
z1
z2
zi
zn
E

wkj
k 1 z k
n
Forward propagation
a1
w11 z
1
w12
a2
n
w13
zi   ak wik
k 1
a3
The activations of the neurons are propagated
Forward propagation
a1
w11 z
1
a1II
w12
a2
w13
a1II  f z1  
a3
The activations of the neurons are propagated
1
1  e  z1
Backpropagation
E
z1
w12
a2
w22
n
E
E

wkj
a j k 1 z k
E
z 2
The error function gradient is propagated
Backpropagation
E
z1
E
a1II
E E aiII
 II

zi ai zi
w12
a2
w22
E E
z 2 a2II
E II
 II ai 1  aiII
ai

The error function gradient is propagated

Single algorithm 1cycle
E   a  d p 
E
E N
III p 1
III



a
1

a
III
III
z
a E
III
EIII  2aEp  IIId p 
aIIp  III w11
N
III
p
a1
z
w11III EIII
One complete cycle
of the algorithm
is finished
(situation equivalent
to the initial)
2
z
w12III
E
E III

w12
II
III
a2
z
E
E
III
a III
a
Forward propagation
• One cycle of algorithm:
– get inputs of the current layer
– compute the excitations of the considered layer,
„transferring” inputs through the layer of weights
(multiplying the inputs by the corresponding weights
and performing the summation)
– calculate the activations of the layer’s neurons by
transferring the neuron excitations through the
activation functions
• Repeat that cycle, starting with the layer 1 on to
the output layer. The activations of neurons of the
output layer are the outputs of the network
Backpropagation
• One cycle of the algorithm:
– get error function gradients with respect to the outputs
of the layer
– compute the error gradients with respect to the
excitations of the layer’s neurons by transferring the
gradients backward through the derivatives of the
neuron activation functions
– compute the error function gradients with respect to the
outputs of the prior layer by transferring the so far
computed gradients through the layer of weights
(multiplying the gradients by the corresponding weights
and performing the summation)
Backpropagation
• Repeat that cycle starting from the last layer
– the error function gradients can be
computed directly – on toward the first
layer. The gradients computed through the
process can be used to calculate gradients
with respect to the weights
BP Algorithm
• It all ends up with an computationally effective
and elegant procedure to compute partial
derivative of the error function with respect to
every weight in a network.
• It allows us to correct every weight of a network
in such a way co reduce the error
• Repeating the process on and on gradually reduces
the error and constitutes the learning process
Example source code (MATLAB)
Learning rate
• Term η is called „learning rate”
E
wi  
wi
The faster, the better, but too fast can cause
the learning process to become unstable
Learning rate
• In practice – we have to manipulate the
learning rate during the course of learning
process
• The strategy of the constant learning rate is
not too good
Two types of problems
• Data grouping/classification
• Function approximation
Classification
...
...
Classification
System
klasyfikacyjny
system
7!
Classification
Alternative scheme:
Classification
system
...
...
0 (1%)
1 (1%)
2 (1%)
3 (1%)
4 (1%)
5 (1%)
6 (1%)
7 (90%)
8 (1%)
9 (1%)
Brak
decyzji
Classification – typical
applications
Classification == Pattern recognition:
• medical diagnosis
• fault condition recognition
• handwriting recognition
• object identification
• decision support
Classification example
Applet: Character recognition
Classification
• Assumes that a class is a group of similar
objects
• Similarity has to be defined
• Similar objects – objects having similar
attributes
• We have to describe the attributes
Classification
• E.g. some of the human attributes:
– Height
– Age
Class K: „Tall people under 30”
Classification
Object O1 belonging to the class K:
„A person 180 cm high, 23 years old”
(180, 23)
Object O2 that doesn’t belong to the class K:
„A person 165cm high, 35 years old”
(165, 35)
Classification
WIEK
AGE
35
23
165
HEIGHT
WZROST
180
The similarity of objects
AGE
35
23
165
HEIGHT
180
The similarity
• Euklidean distance (Euclidean metric):
d
 x2  x1    y2  y1 
2
2
Other metrics
•
Manhattan
metric
Classification
• The more attributes – the more dimensions:
X1
X2
X3
Multidimensional metric
d
 x11  x21    x12  x22    x13  x23 
2
2
2
Multidimensional data
• OLIVE presentation
Classification
Atr 1
Atr 2 Atr 4
Atr 6
Atr 3 Atr 5
Atr 8, itd. ..
Atr 7
Classification
Y=K*X
• Wytyczenie granicy między dwoma grupami
AGE > K*HEIGHT
AGE
AGE = K*HEIGHT
WIEK < K*HEIGHT
35
23
HEIGHT
Classification
AGE = K*HEIGHT+B
AGE-K*HEIGHT-B=0
AGE
AGE+K2*HEIGHT+B2=0
35
23
HEIGHT
Classification
• In general, for the multidimensional case, so
called classification hiperplane is described
by:
w1 x1  w2 x2 
 wn xn  b  0
• We are very close to the McCulloch-Pitts ...
w1 x1  w2 x2 McCulloch-Pitts
wn xn
w1 x1  w2 x2 
wn xn   ?
Neuron as a simple classifier
• Single McCullocha-Pittsa threshold unit
performs a linear dichotomy (separation of
two classes in the multidimensional space)
• Tuning the weights and threshold changes
the orientation of the separating hyperplane
Neuron as a simple classifier
• If we tune the weights properly (train the
neuron properly), it will classify the
processed objects
• Processing an object means – exposing the
object attributes on the neuron inputs
More classes
• More neurons – a network
• Every neuron performs a bisection of the feature space
• A few neurons partitions the space to a few distinct
areas
Sigmoidal activation function
AGE
35
23
HEIGHT
Classification example
• NeuroSolutions – Principal Component
Complicated separation border
• Neurosolutions – Support Vector Machine
Aproksymacja
X
Y
BLACK BOX
?
Y  F (X)
Example
• True phenomenon
Example
• There is only a limited number of
observations:
Example
• And the observations are corrupted:
Typical situation
• We have a small amount of data
• Data is corrupted (we are not certain of how
reliable it is)
Example
• The experimenter sees only the data:
Experimenter/system task
• „To fill the gaps”?
• We would call that „an interpolation”
• But what we truly think of is an
approximation: looking for a model
(„trace”), which is most similar
(approximate) to the unknown (!) true
phenomenon
Example
• We can apply e.g. a MATLAB polyfit:
Polyfit
• Polynomial approximation
f  x   a0  a1 x  a2 x  an x
2
2
Example
• Polyfit with 2nd order polynomial:
Example
• But how come we know, we should apply
the 2nd order polynomial?
Example
• And what if we apply 15th degree? It fits the date
much better (but it doesn’t fit the original well):
The variance factor
• The higher the degree the more „freaky” it gets
• 15th degree is quite flexible – can be fit to many
things
• However, the generalization is sacrificed – the
model fits well the data, but most probably would
fail on other data that would come later
• That’s closing too much to the modelling the
variance of the data
Example
• We could also insist on the 1st order
Example
• ... or even, the 0th order (the data are almost
completely ignored)...
The bias factor
• Lower polynomial degree means lower
flexibility
• Arbitral model degree choice is what we
called an inductive bias
• It is a kind of a priori knowledge, we
introduce
• In case of 0th and 1st order the bias is too
strong
Polyfit
A polynomial:
Training set:
y  a0  a1 x  a2 x 2  an x 2
( x , d ), ( x , d ), , ( x
1
1
2
2
p
, d p )
Polyfit:
d1  a0  a1 x1  a x  a x
2
2 1
3
n 1
d 2  a0  a1 x2  a x  a x
2
2 2
3
n 2

d p  a0  a1 x p  a2 x 2p  an x 3p
Approximation
• Linear model:
• A model employing polynomials (linear as
well):
Aproksymacja
• Uogólniony model liniowy:
Approximation
• hk() funcunctions can be various
polynomial, sinus,
• Can be sigmoid as well
Approximation
• ANN can do a linear model...
h1 
w11
w11h1   w21h2 
w21
h2 
Approximation
• But can do much more!
ANN transfer function
• This looks like nonlinear function, indeed ...
f W, x  
1
1 e







1
II


 w


 w I x   






III 

 1e
 w  1 e







Approximation
• An Artificial Neural Network build on processing
elements with sigmoidal activation functions is an
universal approximator for the functions of class
C1 (continuous to the first derivative) – Hornik,
1983
• Every typical transfer function can be modelled,
with an arbitrary precision, provided there is an
appropriate number of neurons
Przykład aproksymacji funkcji
• Applet Java – function approximation
Where to go now?
• This set of slides:
– http://pr.radom.net/~pgolabek/Antwerp/NNIntro.ppt
• Be sure to check the comp-ai.neural-nets FAQ:
– http://www.faqs.org/faqs/ai-faq/neural-nets/
• Books:
– Simon Haykin: „Neural networks – a comprehensive
direction”
– Christopher Bishop: „Neural networks for pattern
recognition”
– „Neural and adaptive systems” – the NeuroSolutions
interactive book (www.nd.com)
Where to go now
• Software
–
–
–
–
NeuroSolution – www.nd.com
MATLAB Neural Networks Toolbox
SNNS - Stuttgart Neural Network Simulator
and countless other