* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download NeuralNets273ASpring09
Apical dendrite wikipedia , lookup
Single-unit recording wikipedia , lookup
Nonsynaptic plasticity wikipedia , lookup
Multielectrode array wikipedia , lookup
Mirror neuron wikipedia , lookup
Neural oscillation wikipedia , lookup
Axon guidance wikipedia , lookup
Stimulus (physiology) wikipedia , lookup
Caridoid escape reaction wikipedia , lookup
Holonomic brain theory wikipedia , lookup
Neural engineering wikipedia , lookup
Neural coding wikipedia , lookup
Neuroanatomy wikipedia , lookup
Feature detection (nervous system) wikipedia , lookup
Premovement neuronal activity wikipedia , lookup
Neuropsychopharmacology wikipedia , lookup
Pre-Bötzinger complex wikipedia , lookup
Artificial neural network wikipedia , lookup
Optogenetics wikipedia , lookup
Central pattern generator wikipedia , lookup
Development of the nervous system wikipedia , lookup
Pattern recognition wikipedia , lookup
Gene expression programming wikipedia , lookup
Time series wikipedia , lookup
Biological neuron model wikipedia , lookup
Metastability in the brain wikipedia , lookup
Neural modeling fields wikipedia , lookup
Channelrhodopsin wikipedia , lookup
Catastrophic interference wikipedia , lookup
Synaptic gating wikipedia , lookup
Nervous system network models wikipedia , lookup
Backpropagation wikipedia , lookup
Convolutional neural network wikipedia , lookup
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Neurons • Neurons communicate by receiving signals on their dendrites. Adding these signals and firing off a new signal along the axon if the total input exceeds a threshold. • The axon connects to new dendrites through synapses which can learn how much signal is transmitted. • McCulloch and Pitt (’43) built a first abstract model of a neuron. 1 b y g (Wi xi b ) i output activation function input weights bias Neurons • We have about 1011 neurons, each one connected to 10 4 other neurons on average. • Each neuron needs at least 10 3 seconds to transmit the signal. • So we have many, slow neurons. Yet we recognize our grandmother in 10 sec. 1 • Computers have much faster switching times: 10 10 sec. • Conclusion: brains compute in parallel ! • In fact, neurons are unreliable/noisy as well. But since things are encoded redundantly by many of them, their population can do computation reliably and fast. Classification / Regression • Neural nets are a parameterized function Y=f(X;W) from inputs (X) to outputs (Y). • If Y is continuous: regression, if Y is discrete: classification. • We adapt the weights so as to minimize the error between the data and the model predictions. Or, in other words: maximize the conditional probability of the data (output, given attributes) • E.g. 2-class classification (y=0/1) N P (y | x ) f (xn )yn (1 f (xn ))1Yn n 1 N error yn logf (xn ) (1 yn )log(1 f (xn )) n 1 looks familiar ? fi (x ) Wij Logistic Regression xj • If we use the following model for f(x), we obtain logistic regression! yn f (xn ;W ) (Wi xin b ) 1 (z ) 1 exp( z ) i • This is called a “perceptron” in neural networks jargon. • Perceptrons can only separate classes linearly. fi (x ) Wij xj Regression • Probability of output given input attributes is Normally distributed. 1 2 p (y | x ) c exp (yin Wij x jn bi ) j n 1 i 1 2 N • Error is negative log-probability = squared loss function: N dout din n 1 i 1 j 1 error (yin Wij x jn bi )2 Optimization • We use stochastic gradient descent: pick a single data-item, compute the contribution of that data-point to the overall gradient and update the weights. Repeat: 1) Pick randon data-item (yn , xn ) 2a ) d errorn (yin Wik xkn )x jn dWij k 2b ) d errorn (yin Wik xkn ) dbi k 3a ) Wij Wij 3b ) bi bi d errorn dWij d errorn dbi Stochastic Gradient Descent stochastic updates full updates (averaged over all data-items) • Stochastic gradient descent does not converge to the minimum, but “dances” around it. • To get to the minimum, one needs to decrease the step-size as one get closer to the minimum. • Alternatively, one can obtain a few samples and average predictions over them (similar to bagging). Multi-Layer Nets yˆi g (Wij3hj2 bi 3 ) y j W3,b3 h2 W2,b2 hi 2 g (Wij2hj1 bi 2 ) j hi 1 g (Wij1x j bi 1 ) h1 j W1,b1 x • Single layers can only do linear things. If we want to learn non-linear decision surfaces, or non-linear regression curves, we need more than one layer. • In fact, NN with 1 hidden layer can approximate any boolean and cont. functions Back-propagation • How do we learn the weights of a multi-layer network? Answer: Stochastic gradient descent. But now the gradients are harder! example: error yin log in (1 yin )log(1 in ) y in d error d errorn d in 2 2 dWjk dW in in jk d Wij3hjn2 bi 3 j d errorn in (1 in ) 2 in dWjk in in d errorn in in (1 in )W 3 ij dhjn2 dWjk2 d errorn d errorn in in in in h2 d Wjk2hkn1 bj2 d errorn in (1 in )Wij3 jn (1 jn ) k 2 in dWjk in W3,b3 W2,b2 h1 in (1 in )Wij3 jn (1 jn )hkn1 in (1 in )Wij3 jn (1 jn ) ( Wkl1 xln bk1 ) l W1,b1 x i Back Propagation y i yˆi (W h bi ) 3 2 ij j j 3 hi 2 (Wij2hj1 bi 2 ) in3 yˆin (1 yˆin ) h2 d errorin d in jn2 hjn2 (1 hjn2 ) Wij3in3 kn1 hkn1 (1 hkn1 ) 2 Wjk2 jn upstream i j W2,b2 W2,b2 h1 i W3,b3 W3,b3 h2 y hi 1 (Wij1x j bi 1 ) W1,b1 x Upward pass h1 j W1,b1 x downward pass upstream j Back Propagation y i in3 yˆin (1 yˆin ) d errorin d in W3,b3 h2 d error d errorn in (1 in )Wij3 jn (1 jn ) kn 2 dWjk in in 2 1 jn hkn jn2 hjn2 (1 hjn2 ) upstream i Wij3in3 2 1 Wjk2 Wjk2 jn hkn W2,b2 h1 2 bj2 bj2 jn kn1 hkn1 (1 hkn1 ) upstream j W1,b1 x 2 Wjk2 jn ALVINN Learning to drive a car This hidden unit detects a mildly left sloping road and advices to steer left. How would another hidden unit look like? Weight Decay • NN can also overfit (of course). • We can try to avoid this by initializing all weights/biases terms to very small random values and grow them during learning. • One can now check performance on a validation set and stop early. • Or one can change the update rule to discourage large weights: 2 1 Wjk2 Wjk2 jn hkn Wjk2 2 bj2 bj2 jn bj2 • Now we need to set using X-validation. • This is called “weight-decay” in NN jargon. Momentum • In the beginning of learning it is likely that the weights are changed in a consistent manner. • Like a ball rolling down a hill, we should gain speed if we make consistent changes. It’s like an adaptive stepsize. • This idea is easily implemented by changing the gradient as follows: 2 1 Wjk2 (new ) jn hkn Wjk2 (old ) Wjk2 Wjk2 Wjk2 (new ) (and similar to biases) Conclusion • NN are a flexible way to model input/output functions • They can be given a probabilistic interpretation • They are robust against noisy data • Hard to interpret the results (unlike DTs) • Learning is fast on large datasets when using stochastic gradient descent plus momentum. • Overfitting can be avoided using weight decay or early stopping • There are also NN which feed information back (recurrent NN) • Many more interesting NNs: Boltzman machines, self-organizing maps,...