Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Multi-Layer Perceptron non-linear modelling by adaptive pre-processing Rob Harrison AC&SE target, z error, e pre-processing layer adaptive layer output, y input data, x Adaptive Basis Functions • “linear” models – fixed pre-processing – parameters → cost “benign” – easy to optimize but – combinatorial – arbitrary choices what is best pre-processor to choose? The Multi-Layer Perceptron • formulated from loose biological principles • popularized mid 1980s – Rumelhart, Hinton & Williams 1986 • Werbos 1974, Ho 1964 • “learn” pre-processing stage from data • layered, feed-forward structure – sigmoidal pre-processing – task-specific output non-linear model Generic MLP xu L j A Y E xd R more layers x1 I N x2 P U T O U T P U T k i HIDDEN LAYERS L0 L1 LM-1 LM L A Y E R y1 y2 yi yn Two-Layer MLP x1 x2 y xu y wT v b j v j wTj x b j xd L0 L1 L2 A Sigmoidal Unit x1 y j f a j x2 wj wj2 wju xu xd wjd f b j i w ji xi aj f yj Combinations of Sigmoids 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -3 -2 -1 0 1 2 3 In 2-D from Pattern Classification, 2e, by Duda, Hart & Stork, © 2001 John Wiley Universal Approximation • linear combination of “enough” sigmoids – Cybenko, 1990 • single hidden layer adequate – more may be better • choose hidden parameters (w, b) optimally problem solved? Pros • compactness – potential to obtain same veracity with much smaller model • c.f. sparsity/complexity control in linear models Compactness of Model 1 d=5 d=10 d=50 mlp 0.9 0.8 reduction in SSE 0.7 MLP O 1/ H 0.6 0.5 0.4 SER O 1/ H 0.3 0.2 0.1 0 0 10 1 10 10 'size' of pre-processor 2 3 10 2 d & Cons • parameter → cost “malign” – optimization difficult – many solutions possible – effect of hidden weights in output non-linear x1 x2 y xu j xd L0 L1 L2 Multi-Modal Cost Surface 5 gradient? 0 -5 -10 global min 2 local min 0 -2 -3 -2 -1 0 1 2 3 Heading Downhill • assume – minimization (e.g. SSE) – analytically intractable • step parameters downhill • wnew = wold + step in right direction • backpropagation (of error) – slow but efficient • conjugate gradients, Levenburg/Marquardt – for preference Neural Network Steepest DESIGNDescent Backprop #1 W1(1,1) W2(1,1) a2 b1(1) p Use the radio b to select the ne parameters to with backpropa The correspon error surface a contour are sh below. Click in the con graph (on the r to start the steepest desce learning algori 1 W1(2,1) b2 W2(1,2) 1 b1(2) 10 2 b1(1) Sum Sq. Error 1 1 0 -10 0 10 20 W1(1,1) 30 0 -10 -20 b1(1) 10 0 -10 -20 -10 0 10 20 W1(1,1) 30 Chapter 1 DESIGNDescent Backprop #1 Steepest Neural Network W2(1,1) W1(1,1) a2 b1(1) p Use the radio b to select the ne parameters to with backpropa The correspon error surface a contour are sho below. Click in the con graph (on the r to start the steepest desce learning algorit 1 b2 W2(1,2) W1(2,1) 1 b1(2) 10 2 b1(1) Sum Sq. Error 1 1 0 -10 0 10 20 W1(1,1) 30 0 -10 -20 b1(1) 10 0 -10 -20 -10 0 10 20 W1(1,1) 30 Chapter 1 below. Click in the contour Neural Network Conjugate DESIGN Gradient Back graph (on the right) conjugatethe gradients to start 15 steepest descent learning algorithm. 10 backprop 0 5 0 b1(1) 0 0 -10 -15 0 -10 -5 -20 0 10 20 W1(1,1) -25 -10 30 0 Chapter 12 10 W1(1,1) 20 30 Implications • correct structure can get “wrong” answer – dependency on initial conditions – might be good enough • train / test (cross-validation) required – is poor behaviour due to • network structure? • ICs? additional dimension in development Is It a Problem? • pros seem to outweigh cons • good solutions often arrived at quickly • all previous issues apply – sample density – sample distribution – lack of prior knowledge How to Use • to “generalize” a GLM – linear regression – curve-fitting • linear output + SSE – logistic regression – classification • logistic output + cross-entropy (deviance) • extend to multinomial, ordinal – e.g. softmax outptut + cross entropy – Poisson regression – count data What is Learned? • the right thing – in a maximum likelihood sense theoretical • conditional mean of target data, E(z|x) – implies probability of class membership for classification P(Ci|x) estimated • if good estimate then y → E(z|x) Simple 1-D Function Neural Network DESIGN Generalization Neural Network DESIGN Function Approximation Function Approximation 2 2 Target Target 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 Input 0.5 1 Generalization 1.5 Use to ch num in th 1 Use the slide bar to choose the number of neurons in the hidden layer. 0.5 0 -2 2 Click butto logs netw data Click the [Train] button to train the 1.5 logsig-linear network on the data points at left. -1.5 -1 -0.5 0 Input 0.5 1 1.5 2 Number of Hidden Neurons S1: 4 Number of Hidden Neurons S1: 9 1 9 1 9 Difficulty Index: 1 Difficulty Index: Chapter 11 1 1 9 1 9 More Complex 1-D Example Neural Network DESIGN Neural Network DESIGN Generalization Function Approximation Function Approximation 2 2 Click the [Train] button to train the logsig-linear 1.5 network on the data points at left. C bu lo ne da Target Target 1.5 Use the 1 slide bar to choose the number of neurons in the hidden layer. 0.5 1 0.5 0 -2 Generalizati -1.5 -1 -0.5 0 Input 0.5 1 1.5 0 -2 2 -1.5 U to nu in -1 -0.5 0 Input 0.5 1 1.5 2 Number of Hidden Neurons S1: 4 Number of Hidden Neurons S1: 7 1 9 1 9 Difficulty Index: 9 Difficulty Index: 9 Chapter 11 1 9 1 9 Local Solutions Neural Network DESIGN Generalization Neural Network DESIGN Function Approximation Function Approximation 2 2 Target Target Click butto logsig netwo data Click the [Train] button to train the 1.5 logsig-linear network on the data points at left. 1.5 Use t to ch numb in the Use 1 the slide bar to choose the number of neurons in the hidden layer. 0.5 1 0.5 0 -2 Generalization -1.5 -1 -0.5 0 Input 0.5 1 1.5 0 -2 2 -1.5 -1 -0.5 0 Input 0.5 1 1.5 2 Number of Hidden Neurons S1: 9 Number of Hidden Neurons S1: 9 1 9 1 9 Difficulty Index: 9 Difficulty Index: Chapter 11 9 1 9 1 9 C A Dichotomy 1 0.8 X2 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 X1 0.5 1 1.5 data courtesy B Ripley 1 0.8 0.8 0.6 0.6 P(C2|x) 1 1 P(C |x) Class-Conditional Probabilities 0.4 0.4 0.2 0.2 0 2 0 2 1 1 1 1 0 0 x2 0 0 -1 -1 -2 x1 x2 -1 -1 -2 x1 Linear Decision Boundary 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 induced by P(C1|x) = P(C2|x) 1 1 0.8 0.8 0.6 0.6 P(C2|x) 1 1 P(C |x) Class-Conditional Probabilities 0.4 0.4 0.2 0.2 0 2 0 2 1 1 1 1 0 0 x2 0 0 -1 -1 -2 x1 x2 -1 -1 -2 x1 Non-Linear Decision Boundary 1.2 1 0.8 x2 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 x1 0.5 1 1 1 0.8 0.8 0.6 0.6 P(C2|x) P(C1|x) Class-Conditional Probabilities 0.4 0.4 0.2 0.2 0 2 0 2 1 1 1 1 0 0 x2 0 0 -1 -1 -2 x1 x2 -1 -1 -2 x1 Over-fitted Decision Boundary 1.2 1 0.8 x2 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 x1 0.5 1 Invariances • e.g. scale, rotation, translation • augment training data • design regularizer – tangent propagation (Simmard et al, 1992) • build into MLP structure – convolutional networks (LeCun et al, 1989) Multi-Modal Outputs • assumed distribution of outputs unimodal • sometimes not – e.g. inverse robot kinematics • E(z|x) not unique • use mixture model for output density • Mixture Density Networks (Bishop, 1996) Bayesian MLPs • deterministic treatment • point estimates, E(z|x) • can use MLP as estimator in a Bayesian treatment – non-linearity → approximate solutions • approximate estimate of spread E((z-y)2|x) NETTalk • • • • Introduced by T Sejnowski & C Rosenberg First realistic ANNs demo English text-to-speech non-phonetic Context sensitive – brave, gave, save, have (within word) – read, read (within sentence) • Mimicked DECTalk (rule-based) Representing Text • 29 “letters” (a–z + space, comma, stop) – 29 binary inputs indicating letter • Context – local (within word) – global (sentence) • 7 letters at a time: middle → AFs – 3 each side give context, e.g. boundaries Coding /k/ MLP H E _ C A T _ NETTalk Performance • 1024 words continuous, informal speech • 1000 most common words • Best performing MLP • 26 o/p, 7x29 i/p, 80 hidden • 309 units, 18,320 weights • 94% training accuracy • 85% testing accuracy