Download 5-NeuralNetworks

Machine Learning Neural Networks Ch. 4 Biological motivation • Can we simulate the human learning process?  Two schools • modeling biological systems • ‘just’ building learning systems • Biological learning system (brain) – complex network of neurons – properties • Neuron switching time ~0.001 second • Number of neurons ~ 1010 • Connections per neuron ~104 - 105 • Scene recognition time ~0.1 second • 100 inference steps doesn't seem like enough • much parallel computation Artificial Neural Networks (ANN) • ANN – network of simple units – real-valued inputs & outputs • • • • Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically Neural Speed Constraints • Neurons have a “switching time” on the order of a few milliseconds, compared to nanoseconds for current computing hardware. • However, neural systems can perform complex cognitive tasks (vision, speech understanding) in tenths of a second. • Only time for performing 100 serial steps in this time frame, compared to orders of magnitude more for current computers. • Must be exploiting “massive parallelism.” • Human brain has about 1011 neurons with an average of 104 connections each. R. Mooney, UT Austin Neural Network Learning • Learning approach based on modeling adaptation in biological neural systems. • Perceptron: Initial algorithm for learning simple neural networks (single layer) developed in the 1950’s. • Backpropagation: More complex algorithm for learning multi-layer neural networks developed in the 1980’s. R. Mooney, UT Austin Real Neurons How does our Brain Work? • A neuron is connected to other neurons via its input and output links. • Each incoming neuron has an activation value and each connection has a weight associated with it. • The neuron sums the incoming weighted values and this value is input to an activation function. • The output of the activation function is the output from the neuron. Neural Communication • • • • • Electrical potential across cell membrane exhibits spikes called action potentials. Spike originates in cell body, travels down axon, and causes synaptic terminals to release neurotransmitters. Chemical diffuses across synapse to dendrites of other neurons. Neurotransmitters can be excititory or inhibitory. If net input of neurotransmitters to a neuron from other neurons is excititory and exceeds some threshold, it fires an action potential. R. Mooney, UT Austin Real Neural Learning • To model the brain we need to model a neuron. • Each neuron performs a simple computation. – It receives signals from its input links and it uses these values to compute the activation level (or output) for the neuron. – This value is passed to other neurons via its output links. Perceptrons • Introduced in the late 50s . Minsky and Papert. • Perceptron convergence theorem Rosenblatt 1962: • Perceptron will learn to classify any linearly separable set of inputs. XOR function Perceptron is a network: – single-layer – feed-forward the data only travels in one direction (from the input neurons to the output neurons) ALVINN drives 70 mph on highways Perceptron The input value received of a neuron is calculated by summing n the weighted input values from its input links  wi xi i 0 threshold function Vector notation: Different Threshold Functions     1, w  x  0 o( x )      1, w  x  0     1, w  x  t o( x )   0, otherwise     1, w  x  0 o( x )      1, w  x  0     1, w  x  0 o( x )      1 , w x 0  We should learn the weight w1,…, wn Examples (step activation function) In1 In2 Out In1 In2 Out In Out 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 Perceptron Training • Assume supervised training examples giving the desired output for a unit given a set of known input activations. • Learn synaptic weights so that unit produces the correct output for each example. • Perceptron uses iterative update algorithm to learn a correct set of weights. R. Mooney, UT Austin Perceptron Training Rule • Learn weights of a single unit • Problem – given examples labeled +1/-1 – learn weight vector • Algorithms – perceptron rule, delta rule, … – guaranteed to converge – different acceptable hypotheses & assumptions Perceptron Training Rule • Update weights by: wi  wi  wi wi   (t  o ) wi where η is the learning rate t – target output for the current training example • Equivalent to rules: – If output is correct do nothing. – If output is high, lower weights on active inputs – If output is low, increase weights on active inputs Perceptron training rule • Can prove it will converge – if training data is linearly separable – and η sufficiently small Perceptron Learning Algorithm • Iteratively update weights until convergence. Initialize weights to random values Until outputs of all training examples are correct For each training pair, E, do: Compute current output oj for E given its inputs Compare current output to target value, tj , for E Update synaptic weights and threshold using learning rule • Each execution of the outer loop is typically called an epoch. R. Mooney, UT Austin Delta Rule • Works reasonably with non-separable data • Minimizes error • Gradient descent method – basis of Backpropagation method – basis for methods working in multidimensional continuous spaces Gradient Descent Preliminaries • Tangent • Derivatives Tangent, Inflection Point • In geometry, the tangent line (or simply the tangent) to a curve at a given point is the straight line that "just touches" the curve at that point • In differential calculus, an inflection point, or point of inflection (or inflexion) is a point on a curve at which the curvature changes sign. The curve changes from being concave upwards (positive curvature) to concave downwards (negative curvature), or vice versa. Visualization (tangent) Visualization (inflection point) Extremum: Maximum and Minimum global maximum local maximum local minimum global minimum • In mathematics, Fermat's theorem is a theorem in real analysis, named after Pierre de Fermat. • It gives a method to find local maxima and minima of differentiable functions on open sets by showing that every local extremum of the function is a stationary point (the function derivative is zero in that point). Hypothesis space • Example case: two weights – error surface E(w) – parabolic (by definition) – single global minimum – arrow: negated gradient at one point – steepest descent along the surface Gradient Descent • To understand, consider a linear unit, where o  w0  w1 x1  ...  wn xn • Let’s learn the wi’s that minimize the squared error  1 E[ w]  2  (td  od ) 2 d D • • • Where D is the set of training examples In statistics, mean squared error is one of many ways to quantify the amount by which an estimator differs from the true value of the quantity being estimated We should minimize squared error There exist alternative error functions Hypothesis space • Example case: two weights – error surface E(w) – parabolic (by definition) – single global minimum – arrow: negated gradient at one point – steepest descent along the surface Gradient Descent  E(w)  [ wE0 , wE1 ,..., wEn ] Partial derivatives of E with respect to weights wi is itself a vector, whose components are the partial derivatives of E with respect to each of the wi. When interpreted as a vector in weightspace, the gradient specifies the direction that produces the steepest increase in E. The negative of this vector therefore gives the direction of steepest decrease where , Hypothesis space • Example case: two weights – error surface E(w) – parabolic (by definition) – single global minimum – arrow: negated gradient at one point – steepest descent along the surface Derivation of the rule • Compute the gradient of E(w) – – – – – – vector E(w) of partial derivatives specifies the direction of steepest increase in E training rule: w = -E(w) componentwise wi = wi +  wi,  wi = -E/ wi wi is changed in proportion to E/ wi Practical algorithm • Efficient computation of E/ wi – Reduces to sum ((t - o)(- xi)) • Converges to minimum error – too large   may oscillate – common modification: gradually decrease learning rate Problems with gradient descent • Gradient descent – continuous hypothesis space – error can be differentiated wrt hypothesis parameters • Difficulties – converging can be slooooow – many local minima  no guarantee we find the global one Remarks • Perceptron rule – thresholded output – converges after a finite # of iterations – provided data is separable • Delta rule – unthresholded – asymptotic convergence – regardless of training data Perceptron as a Linear Separator • Since perceptron uses linear threshold function, it is searching for a linear separator that discriminates the classes. o3 w12o2  w13o3  T1 ?? w12 T1 o3   o2  w13 w13 o2 R. Mooney, UT Austin Or hyperplane in n-dimensional space Concept Perceptron Cannot Learn • Cannot learn exclusive-or, or parity function in general. o3 1 + ?? – – + 0 1 o2 R. Mooney, UT Austin Perceptron Limits • System obviously cannot learn concepts it cannot represent. • Minksy and Papert (1969) wrote a book analyzing the perceptron and demonstrating many functions it could not learn. • These results discouraged further research on neural nets; and symbolic AI became the dominate paradigm. R. Mooney, UT Austin Perceptron as Hill Climbing • • • • The hypothesis space being search is a set of weights and a threshold. Objective is to minimize classification error on the training set. Perceptron effectively does hill-climbing (gradient descent) in this space, changing the weights a small amount at each point to decrease training set error. For a single model neuron, the space is well behaved with a single minima. training error 0 weights R. Mooney, UT Austin Perceptron Performance • Linear threshold functions are restrictive (high bias) but still reasonably expressive; more general than: – Pure conjunctive – Pure disjunctive – M-of-N (at least M of a specified set of N features must be present) • In practice, converges fairly quickly for linearly separable data. • Can effectively use even incompletely converged results when only a few outliers are misclassified. • Experimentally, Perceptron does quite well on many benchmark data sets. R. Mooney, UT Austin When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g. raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant • Examples: – Speech phoneme recognition [Waibel] – Image classification [Kanade, Baluja, Rowley] – Financial prediction

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 5-NeuralNetworks