Download lec1b

CSC2535: Computation in Neural Networks Lecture 1b: Learning without hidden units. Geoffrey Hinton www.cs.toronto.edu/~hinton/csc2535/notes/lec1b.htm Linear neurons • The neuron has a realvalued output which is a weighted sum of its inputs weight vector ˆy   wi xi  wT x i Neuron’s estimate of the desired output input vector • The aim of learning is to minimize the discrepancy between the desired output and the actual output – How de we measure the discrepancies? – Do we update the weights after every training case? – Why don’t we solve it analytically? The delta rule • Define the error as the squared residuals summed over all training cases: • Now differentiate to get error derivatives for weights E 1 2 E  wi 2 ˆ ( y  y )  n n n 1 2 yˆ n En  w yˆ i n n   xi ,n ( yn  yˆ n ) n • The batch delta rule changes the weights in proportion to their error derivatives summed over all training cases E wi   wi The error surface • The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error. – It is a quadratic bowl. – Vertical cross-sections are parabolas. – Horizontal cross-sections are ellipses. E w1 w2 Online versus batch learning • Batch learning does steepest descent on the error surface • Online learning zig-zags around the direction of steepest descent constraint from training case 1 w1 w1 w2 constraint from training case 2 w2 Convergence speed • The direction of steepest descent does not point at the minimum unless the ellipse is a circle. – The gradient is big in the direction in which we only want to travel a small distance. – The gradient is small in the direction in which we want to travel a large distance. E wi    wi • This equation is sick. The RHS needs to be multiplied by a term of dimension w^2. • A later lecture will cover ways of fixing this problem. Preprocessing the input vectors • Instead of trying to predict the answer directly from the raw inputs we could start by extracting a layer of “features”. – Sensible if we already know that certain combinations of input values would be useful – The features are equivalent to a layer of hand-coded non-linear neurons. • So far as the learning algorithm is concerned, the hand-coded features are the input. Is preprocessing cheating? • It seems like cheating if the aim to show how powerful learning is. The really hard bit is done by the preprocessing. • Its not cheating if we learn the non-linear preprocessing. – This makes learning much more difficult and much more interesting.. • Its not cheating if we use a very big set of nonlinear features that is task-independent. – Support Vector Machines make it possible to use a huge number of features without much computation or data. Perceptrons The input is recoded using hand-picked features that do not adapt. Only the last layer of weights is learned. The output units are binary threshold neurons and are learned independently. output units non-adaptive hand-coded features input units Binary threshold neurons • McCulloch-Pitts (1943) – First compute a weighted sum of the inputs from other neurons – Then output a 1 if the weighted sum exceeds the threshold. z   xi wi i y 1 if z  0 otherwise 1 y 0 threshold z The perceptron convergence procedure • Add an extra component with value 1 to each input vector. The “bias” weight on this component is minus the threshold. Now we can forget the threshold. • Pick training cases using any policy that ensures that every training case will keep getting picked – If the output unit is correct, leave its weights alone. – If the output unit incorrectly outputs a zero, add the input vector to the weight vector. – If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector. • This is guaranteed to find a suitable set of weights if any such set exists. Weight space • Imagine a space in which each axis corresponds to a weight. – A point in this space is a weight vector. • Each training case defines a plane. – On one side of the plane the output is wrong. • To get all training cases right we need to find a point on the right side of all the planes. bad weights good weights an input vector o origin Why the learning procedure works • Consider the squared distance between any satisfactory weight vector and the current weight vector. – Every time the perceptron makes a mistake, the learning algorithm moves the current weight vector towards all satisfactory weight vectors (unless it crosses the constraint plane). • So consider “generously satisfactory” weight vectors that lie within the feasible region by a margin at least as great as the largest update. – Every time the perceptron makes a mistake, the squared distance to all of these weight vectors is always decreased by at least the squared length of the smallest update vector. What perceptrons cannot do • The binary threshold output units cannot even tell if two single bit numbers are the 0,1 same! Same: (1,1)  1; (0,0)  1 Different: (1,0)  0; (0,1)  0 • The following set of inequalities is impossible: w1  w2   , 0   w1   , w2   0,0 Data Space 1,1 1,0 The positive and negative cases cannot be separated by a plane What can perceptrons do? • They can only solve tasks if the hand-coded features convert the original task into a linearly separable one. How difficult is this? • The N-bit parity task : – Requires N features of the form: Are at least m bits on? – Each feature must look at all the components of the input. • The 2-D connectedness task – requires an exponential number of features! – Connectedness is much easier to compute using iterative algorithms that propagate markers. Distinguishing T from C in any orientation and position • What kind of features are required to distinguish two different patterns of 5 pixels independent of position and orientation? – Do we need to replicate T and C templates across all positions and orientations? – Looking at pairs of pixels will not work – Looking at triples will work if we assume that each input image only contains one object. Replicate the following two feature detectors in all positions + -+ + + If any of these equal their threshold of 2, it’s a C. If not, it’s a T. Beyond perceptrons • We need to learn the features, not just how to weight them to make a decision. – We may need to abandon guarantees of finding optimal solutions. • Need to make use of recurrent connections, especially for modeling sequences. – Long-term temporal regularities are hard to learn. • Need to learn representations without a teacher. – This makes it much harder to decide what the goal is. • Need to learn complex hierarchical representations. – Must traverse deep hierarchies using fixed hardware. • Need to attend to one part of the sensory input at a time. – Requires segmentation and sequential organization of sensory processing.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download lec1b