Download NNIntro

An Introduction to Artificial Neural Networks Piotr Golabek, Ph.D. Radom Technical University Poland [email protected] An overview of the lecture • What are ANN’s? What are they for? • Neural networks as inductive machines – inductive reasoning tradition • The evolution of the concept – keywords, structures, algorithms An overview of the lecture • Two general tasks: classification and approximation • Above tasks in more familiar setting – decision making, signal processing, control systems • + live presentations What are ANNs? • Don’t ask me ... • „ANN is a set of processing elements (PE’s), influencing each other” • (that definition suit almost everything...) What are ANN’s ... but seriously... • „neural” – following biological (neurophysiological) inspiration, • „artificial” – don’t forget these are not real neurons! • „networks” – strongly interconnected (in fact – massive parallel processing) and the implicit meaning: • ANNs are „learning machines”, i.E. adapt, just as biological neurons do Machine learning • Important field of AI • „A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” (Take a look at „Machine Learning” by Tom Mitchell) What is ANN? • In case of ANNs, the „Experience” is input data (examples) • The ANN is a „inductive learning machine”, i.E. machine constructing internal generalized concepts based on evidence brought by data stream • ANN learns from examples – a paradigm shift What is ANN • Structurally, ANN is a complex, interconnected structure composed of simple processing elements, often mimicking biological neurons • Functionally, ANN is an inductive learning machine, it is able to undergo an adaptation process (learning) driven by examples What are ANNs used for? • Recognition of images, OCR • Recognition of time signal signatures – vibration diagnostic, sonar signal interpretation, detection intrusion patterns in various transaction systems • Trend prediction, esp. in financial markets (bond rating prediction) • Decision support, eg. in credit assessment, medical diagnosis • Industrial process control, eg. the melting parameters in metallurgical processes • Adaptive signal filtering to restore the information from corrupted source Inductive process • Concepts rooted in epistemology (episteme – knowledge) • Heraclitus – „The nature likes to hide” • Observations vs the true nature of the phenomenon • The empiric (experimental) method of developing the model (hypothesis) of the true phenomenon – the inductive process • Something like this goes on during ANN learning ANN as inductive learning machine • „The theory” – the way ANN behaves • „Experimental data” – examples the ANN learns from • New examples cause the ANN to change it’s behaviour, in order to fit better to the evidence brought by examples Inductive process • Inductive bias - the „initial” theory (a priori knowledge) • Variance – the „evidence” brought by data • The strong bias prevents the data to affect the theory • The weak bias makes the theory vulnerable to the data corruption • The game is to properly set the bias-variance balance ANN as inductive learning machines • We can shape the inductive bias of learning process e.g. by tuning the number of neurons • The more neurons, the more flexible the network (the more sensitive to data) Inductive vs deductive reasoning • Reasoning: premises  conclusions • Deductive reasoning – the conclusions are more specific than premises (we just reason the consequences) • Inductive reasoning – the conclusions are more general than premises (we reason the general rules governing the phenomenon from the specific examples) The main goal of inductive reasoning • The main goal: To achive the good generalization – to reason the rule general enough, that it fits to any futer data • This is also the main goal of machine learning – to use the experience in order to build good enough performance (in every possible future situation) McCulloch-Pitts model Warren McCulloch Walter Pitts „A Logical Calculus Immanent in Nervous Activity”, 1943 McCulloch-Pitts model Logical calculus approach: • elementary logical operations: AND, OR, NOT • basic reasoning operator, implication pq (given premises p, we draw conclusion q) McCulloch-Pitts model • Logical operators are functions Truth tables: x y x AND y x y 0 0 0 0 0 0 x 0 1 0 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 xy x y 0 0 1 1 0 1 1 0 1 0 0 1 1 1 x OR y NOT x x y  NOT x  OR y McCulloch-Pitts model • The working question: whether a neuron can perform logical functions AND, OR, NOT • If the answer is yes, the chain of implications (reasoning) could be implemented in neural network McCulloch-Pitts model Inputs Weights Neuron output (activation) Summation Total exicitation Activation Activation function threshold w1McCulloch-Pitts x1  w2 x2  wn xn transfer function w1 x1  w2 x2  wn xn   ? Implementation of AND, OR, NOT • McCulloch-Pitts neuron Including threshold into weights McCulloch-Pitts model • Neuron equations z  w1 x1  w2 x2    wn xn a  f z  n z   wi xi i 1 z  w x T (vector dot product) w1 w2  wn1 wn   xn     xn 1       x2     x1  (vector dot product) x α w x  w  x  w  cos  x x w max similarity x w max „antisimilarity” w max dissimilarity (orthogonality) Vector dot product interpretation • Inputs are called „input vector” • weights are called „weight vector” • Neuron excites, when input vector is similar enough to the weight vector • Weight vector is a „template” for some set of input vectors Neurons – elements of the ANNs Don’t be fooled... These are our neurons ... Neurons – elements of the ANNs Single neuron (stereoscopic) Neurony - elementy składowe sieci neuronowych There is some analogy... The real neuron Synaptic connection – organic structure The real neuron Synaptic connection – the molecular level McCulloch-Pitts model • The conclusion: If we tune the weights of the neuron properly, we can make it implement the transfer function we need (AND, OR, NOT) • The question: What the weights of neurons are tuned in our brains, what is the adaptation mechanism Adaptacja neuronu • Donald Hebb (1949, neurophysiologist): “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” Hebb rule wij   x j yi Hebb rule • It is a local rule of adaptation • The multiplication of input and output signifies a correlation between them • The rule is unstable – a weight can grow without limits (that doesn’t happen in nature, where there are limited resources) • numerous modifications of the Hebb rule has been proposed, to make it stable Hebb rule • Hebb rule is very important and useful ... • ... but for now we want to make the neuron to learn the function we need Rosenblatt Perceptron • Frank Rosenblatt (1958) – Perceptron – hardware (electromechanical) implementation of the ANN (effectively – 1 neuron). Rosenblatt Perceptron • One of the goals of the experiment was to train the neuron, i.E. to make it go active whenever specific pattern appears on „retina” • The neuron was to be trained with examples • The experimenter („teacher”) was to expose the neuron to the different patterns and in each case tell it, whether it should fire, or not • The learning algorithm should do best to make neuron do what the teacher requires Perceptron learning rule Kind of Hebbian rule modification (weight correction depends on the error between actual and desired output) wk   xk y wk   xk d  y  Supervised scheme Supervised scheme • One training example – the pair <input value, desired output> is called a training pair • The set of all the training pairs is called „training set” Unsupervised scheme Example of supervised learning • Linear Associator Neural networks • „A set of processing elements implementing each other” • The neurons (PEs) are interconnected. The output of each neuron can be connected to the input of every neuron, including itself Neural networks • If there is a path of propagation (direct or indirect) between the output of a neuron and its own input, we have feedbacks - such structures are called „recurrent” • If there is no feedback in a network, such structure is called „feedforward” What does recurrent mean? • recurrent definition is a definition of a concept is a definition using the very same concept (but perhaps in lower complixity setup) • recurent function is a function calling itself • classical recurrent definition – factorial function: n! nn 1!, 0! 1 Recurrent connection • „function calling itself” y=f(z)          y t  f z t  f h y t 1 ,  f h f z t 1 ,   Recurrent connection • At any given moment, the whole history of past excitations influences neuron output • The concept of temporal memory emerges • The past influences present to the degree determined by the weight of the recurrent connection • This weight is effectively a „forgetting factor” Feedforward layered network Our brain • There are ca 1011 neurons in our brain • Each of them is connected on averege to 1000 other neurons • There is only one connection per 10 billions of other • If every neuron would be connected to each other, our brain would have to be a few hundred meters in diameter • There is a strong modularity Our brain A fragment af the neural network connecting retina to the visual perception area of the brain Our brain vs computers • The memory size estimation – ca. 1014 connections – gives an estimated size 100TB (each connection has a continous real weight) • Neurons are quite slow, capable of activating no more than 200 times per second, but there are a lot of them, that gives an estimate of 1016 floating point operations per second. Neural networks vs computer •Many (1011) simple processing elements (neurons) • Massively parallel, distributed processing • The momory evenly distributed in the whole structure, content addressable • Large fault tollerance • A few complex processing elements • Sequential, centralized processing • Compact, addressed by an index memory •Large fault vulnerability How to train the whole network? • For the Perceptron – the output of the neuron could be compared to the desired value • But what with the layered structure? How to reach the hidden neurons? • The original idea comes from experiments of Widrow and Hoff in 60s • The global error optimization using gradient descent has been used Supervised scheme once again Error minimization • The error function component can be quite elaborately defined • But the goal is always to minimize the error • One widely used technique of function optimization (minimization/maximization) is called gradient descent Error function • One cycle of training consists of the presentation of many training pairs – it is called one epoch of learning • The error accumulated for the whole epoch is an average: 1 E w   N N  d i 1  yi  , yi  F x; w  2 i Why quadratic function? Error function once again 1 E w   N N  d i 1  yi  , yi  F x; w  2 i • As subsequent input/output pairs are „averaged out”, we can think of the error function mainly as a function of weights w • The goal of learning – to choose weights in such way, that the error would be minimized gives us The functionDerivative is information on whether Error derivative falling, function then the We have totoact We want or sign of the the function increases „against” thethe sign minimize decreases when the derivative is of the derivative. function value, thus argument increases (and negative we have to increase how fast) the argument. wi wi The gradient rule E wi   wi Error function gradient • In multidimensional case we have to do with a vector of error function partial derivatives with respect to each dimension (gradient):  E E g , ,  x1 x2 E  ,  xn  Gradient method E w2 The metod of moving „against” the gradient is commonly called „hillclimbing” w1 Gradient method Steepest descent demo • MATLAB demonstration Other form of activation function • So called sigmoidal function, e.g.: 1 f z   1  e  z Other form of activation function β=1 β=100 β=0.4 Backpropagation algorithm wij  E wij Δwij? Backpropagation algorithm E E  II II  y  z i i E E III II  w ij wki E E III z y Chain rule • Applies chain rule of differentiation: y  g  w f f y  w y w That makes possible to „transfer” the error backward toward hidden units Chain rule Backward propagation through neuron: E z a  f (z ) E a Backpropagation through neuron 1 a  f ( z)  1  ez E E a  z a z Backpropagation through neuron a 1 '  f z    z 1  e z   2  e  z   1    1 e z 1 1  ez 1    z z z z 1 e 1 e 1 e 1 e  1  1  ez       1 1  z 1  e          f z 1  f z   a1  a   Backpropagation through neuron E E a E   a1  a  z a z a Backpropagation through neuron E w2 E z Backpropagation through neuron m z  f ( wi )   wk xk  w1 x1  w2 x2   wi xi   wm xm k 1 E E z  wi z wi Backpropagation through neuron z  w1 x1  w2 x2   wi xi   wm xm    xi wi wi Backpropagation through neuron E E z E   xi wi z wi z Backpropagation through neuron E E E a E  xi  xi  aa  1xi wi z a z a Backpropagation through neuron • Conclusion: if we know the error function gradient with respect to the output of the neuron, we can compute the gradient with respect to each of it’s weights • In general, our goal is to propagate the error function gradient from the output of the network to the outputs of the hidden units Backpropagation • Additional problem: in general, each hidden neuron is connected to more than one neuron of the next layer • There are many paths for the error gradient to be transmitted backward from the next layer Error backpropagation E is a „sensitivity” of E wij to the change of w E ij z1II E z 2II Backpropagation through layer • Applying the rule of derivation for function of compound arguments: f r x , sx , f r x  f sx     x r x  x sx  x we can propagate the error gradient through the layer Backpropagation through layer z1(aj) aj z2(aj) E z1 a j , z2 a j  a j E z1 E z2   z1 a j z2 a j Backpropagation through layer a1 w11 z 1 w12 a2 w13 a3 z1  w11a1  w12a2  w13a3 Backpropagation through layer z1  w11a1  w12a2  w13a3 z1  w12 a2 Ogólniej: zi  wi1a1  wi 2 a2    wij a j   wimam zi  wij a j Backpropagation through layer E z1 , z2 ,, zi ,, zn   a j E E E E  w1 j  w2 j    wij    wnj  z1 z2 zi zn E  wkj k 1 z k n Forward propagation a1 w11 z 1 w12 a2 n w13 zi   ak wik k 1 a3 The activations of the neurons are propagated Forward propagation a1 w11 z 1 a1II w12 a2 w13 a1II  f z1   a3 The activations of the neurons are propagated 1 1  e  z1 Backpropagation E z1 w12 a2 w22 n E E  wkj a j k 1 z k E z 2 The error function gradient is propagated Backpropagation E z1 E a1II E E aiII  II  zi ai zi w12 a2 w22 E E z 2 a2II E II  II ai 1  aiII ai  The error function gradient is propagated  Single algorithm 1cycle E   a  d p  E E N III p 1 III    a 1  a III III z a E III EIII  2aEp  IIId p  aIIp  III w11 N III p a1 z w11III EIII One complete cycle of the algorithm is finished (situation equivalent to the initial) 2 z w12III E E III  w12 II III a2 z E E III a III a Forward propagation • One cycle of algorithm: – get inputs of the current layer – compute the excitations of the considered layer, „transferring” inputs through the layer of weights (multiplying the inputs by the corresponding weights and performing the summation) – calculate the activations of the layer’s neurons by transferring the neuron excitations through the activation functions • Repeat that cycle, starting with the layer 1 on to the output layer. The activations of neurons of the output layer are the outputs of the network Backpropagation • One cycle of the algorithm: – get error function gradients with respect to the outputs of the layer – compute the error gradients with respect to the excitations of the layer’s neurons by transferring the gradients backward through the derivatives of the neuron activation functions – compute the error function gradients with respect to the outputs of the prior layer by transferring the so far computed gradients through the layer of weights (multiplying the gradients by the corresponding weights and performing the summation) Backpropagation • Repeat that cycle starting from the last layer – the error function gradients can be computed directly – on toward the first layer. The gradients computed through the process can be used to calculate gradients with respect to the weights BP Algorithm • It all ends up with an computationally effective and elegant procedure to compute partial derivative of the error function with respect to every weight in a network. • It allows us to correct every weight of a network in such a way co reduce the error • Repeating the process on and on gradually reduces the error and constitutes the learning process Example source code (MATLAB) Learning rate • Term η is called „learning rate” E wi   wi The faster, the better, but too fast can cause the learning process to become unstable Learning rate • In practice – we have to manipulate the learning rate during the course of learning process • The strategy of the constant learning rate is not too good Two types of problems • Data grouping/classification • Function approximation Classification ... ... Classification System klasyfikacyjny system 7! Classification Alternative scheme: Classification system ... ... 0 (1%) 1 (1%) 2 (1%) 3 (1%) 4 (1%) 5 (1%) 6 (1%) 7 (90%) 8 (1%) 9 (1%) Brak decyzji Classification – typical applications Classification == Pattern recognition: • medical diagnosis • fault condition recognition • handwriting recognition • object identification • decision support Classification example Applet: Character recognition Classification • Assumes that a class is a group of similar objects • Similarity has to be defined • Similar objects – objects having similar attributes • We have to describe the attributes Classification • E.g. some of the human attributes: – Height – Age Class K: „Tall people under 30” Classification Object O1 belonging to the class K: „A person 180 cm high, 23 years old” (180, 23) Object O2 that doesn’t belong to the class K: „A person 165cm high, 35 years old” (165, 35) Classification WIEK AGE 35 23 165 HEIGHT WZROST 180 The similarity of objects AGE 35 23 165 HEIGHT 180 The similarity • Euklidean distance (Euclidean metric): d  x2  x1    y2  y1  2 2 Other metrics • Manhattan metric Classification • The more attributes – the more dimensions: X1 X2 X3 Multidimensional metric d  x11  x21    x12  x22    x13  x23  2 2 2 Multidimensional data • OLIVE presentation Classification Atr 1 Atr 2 Atr 4 Atr 6 Atr 3 Atr 5 Atr 8, itd. .. Atr 7 Classification Y=K*X • Wytyczenie granicy między dwoma grupami AGE > K*HEIGHT AGE AGE = K*HEIGHT WIEK < K*HEIGHT 35 23 HEIGHT Classification AGE = K*HEIGHT+B AGE-K*HEIGHT-B=0 AGE AGE+K2*HEIGHT+B2=0 35 23 HEIGHT Classification • In general, for the multidimensional case, so called classification hiperplane is described by: w1 x1  w2 x2   wn xn  b  0 • We are very close to the McCulloch-Pitts ... w1 x1  w2 x2 McCulloch-Pitts wn xn w1 x1  w2 x2  wn xn   ? Neuron as a simple classifier • Single McCullocha-Pittsa threshold unit performs a linear dichotomy (separation of two classes in the multidimensional space) • Tuning the weights and threshold changes the orientation of the separating hyperplane Neuron as a simple classifier • If we tune the weights properly (train the neuron properly), it will classify the processed objects • Processing an object means – exposing the object attributes on the neuron inputs More classes • More neurons – a network • Every neuron performs a bisection of the feature space • A few neurons partitions the space to a few distinct areas Sigmoidal activation function AGE 35 23 HEIGHT Classification example • NeuroSolutions – Principal Component Complicated separation border • Neurosolutions – Support Vector Machine Aproksymacja X Y BLACK BOX ? Y  F (X) Example • True phenomenon Example • There is only a limited number of observations: Example • And the observations are corrupted: Typical situation • We have a small amount of data • Data is corrupted (we are not certain of how reliable it is) Example • The experimenter sees only the data: Experimenter/system task • „To fill the gaps”? • We would call that „an interpolation” • But what we truly think of is an approximation: looking for a model („trace”), which is most similar (approximate) to the unknown (!) true phenomenon Example • We can apply e.g. a MATLAB polyfit: Polyfit • Polynomial approximation f  x   a0  a1 x  a2 x  an x 2 2 Example • Polyfit with 2nd order polynomial: Example • But how come we know, we should apply the 2nd order polynomial? Example • And what if we apply 15th degree? It fits the date much better (but it doesn’t fit the original well): The variance factor • The higher the degree the more „freaky” it gets • 15th degree is quite flexible – can be fit to many things • However, the generalization is sacrificed – the model fits well the data, but most probably would fail on other data that would come later • That’s closing too much to the modelling the variance of the data Example • We could also insist on the 1st order Example • ... or even, the 0th order (the data are almost completely ignored)... The bias factor • Lower polynomial degree means lower flexibility • Arbitral model degree choice is what we called an inductive bias • It is a kind of a priori knowledge, we introduce • In case of 0th and 1st order the bias is too strong Polyfit A polynomial: Training set: y  a0  a1 x  a2 x 2  an x 2 ( x , d ), ( x , d ), , ( x 1 1 2 2 p , d p ) Polyfit: d1  a0  a1 x1  a x  a x 2 2 1 3 n 1 d 2  a0  a1 x2  a x  a x 2 2 2 3 n 2  d p  a0  a1 x p  a2 x 2p  an x 3p Approximation • Linear model: • A model employing polynomials (linear as well): Aproksymacja • Uogólniony model liniowy: Approximation • hk() funcunctions can be various polynomial, sinus, • Can be sigmoid as well Approximation • ANN can do a linear model... h1  w11 w11h1   w21h2  w21 h2  Approximation • But can do much more! ANN transfer function • This looks like nonlinear function, indeed ... f W, x   1 1 e        1 II    w    w I x          III    1e  w  1 e        Approximation • An Artificial Neural Network build on processing elements with sigmoidal activation functions is an universal approximator for the functions of class C1 (continuous to the first derivative) – Hornik, 1983 • Every typical transfer function can be modelled, with an arbitrary precision, provided there is an appropriate number of neurons Przykład aproksymacji funkcji • Applet Java – function approximation Where to go now? • This set of slides: – http://pr.radom.net/~pgolabek/Antwerp/NNIntro.ppt • Be sure to check the comp-ai.neural-nets FAQ: – http://www.faqs.org/faqs/ai-faq/neural-nets/ • Books: – Simon Haykin: „Neural networks – a comprehensive direction” – Christopher Bishop: „Neural networks for pattern recognition” – „Neural and adaptive systems” – the NeuroSolutions interactive book (www.nd.com) Where to go now • Software – – – – NeuroSolution – www.nd.com MATLAB Neural Networks Toolbox SNNS - Stuttgart Neural Network Simulator and countless other

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download NNIntro