Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Radial Basis Function networks for regression and classification April 5, 2011 () Radial Basis Function networks for regression and classification April 5, 2011 1 / 30 Radial Basis Function Networks Outline 1 Radial Basis Function Networks () Radial Basis Function networks for regression and classification April 5, 2011 2 / 30 Radial Basis Function Networks Introduction Classification and Regression Given a dataset (TRAINING SET) of input-target pairs: D = xi ; ti i=1,2,...,N the learning task is to predict the corresponding t to each new input vector x ∈ /D 1 2 classification: ti are discrete (class label) regression: ti continuous value Underlying relation between x and t given by: ti = f xi + i GOAL: to approximate f with a parametric function y(x; w) polinomial, ffw, etc. () Radial Basis Function networks for regression and classification April 5, 2011 3 / 30 Radial Basis Function Networks Radial Basis Function Networks Radial basis function exact interpolation Consider a mapping from an input space X ⊆ RD to a target space Y ⊆R f (x) : x ∈ X ⊆ RD −→ y ∈ Y ⊆ R In general we do not know the function f (x) We only have a set of input-target pairs called Training Set (TS) () x1 .. . → y1 .. . xN → yN Radial Basis Function networks for regression and classification April 5, 2011 4 / 30 Radial Basis Function Networks Radial Basis Function Networks Radial basis function exact interpolation The goal of exact interpolation is to find a function h(x) such that: h(xi ) = t i , i = 1, . . . , N The radial basis function approach introduces a set of N basis functions (one for each data point) of the form: φn (x) = φ(k x − xn k) where φ(·) is some non linear function and k x − xn k denote the Euclidean distance between the input x and the point xn () Radial Basis Function networks for regression and classification April 5, 2011 5 / 30 Radial Basis Function Networks Radial Basis Function Networks Form of the basis functions x2 Gaussian: φ(x) = exp − 2σ 2 −α , with α > 0 φ(x) = x 2 − σ 2 Thin-plate: φ(x) = x 2 ln(x) ... We will consider of Gaussian basis function: the case kx−µj k φj (x) = exp − 2σ2 j () Radial Basis Function networks for regression and classification April 5, 2011 6 / 30 Radial Basis Function Networks Radial Basis Function Networks Radial basis function exact interpolation The output of the mapping is a linear combination of the basis functions: h(x) = N X wj φj (x) j=1 Thus the interpolation condition can be expressed: N X wj Φij = t i i = 1, 2, . . . , N j=1 () Radial Basis Function networks for regression and classification April 5, 2011 7 / 30 Radial Basis Function Networks Radial Basis Function Networks Radial basis function exact interpolation Thus the interpolation condition can be expressed: N X wj Φij = t i i = 1, 2, . . . , N j=1 this condition can be Φ11 Φ12 Φ21 Φ22 .. .. . . rewritten in a matrix . . . Φ1N w1 . . . Φ2N w2 .. .. .. . . . ΦN1 ΦN2 . . . ΦNN wN form: ΦwT 1 t t2 = .. . tN =t Φ is a matrix of dimension N × N of components Φ(i, j) = φj (xi ) w, t are vector of dimension N () Radial Basis Function networks for regression and classification April 5, 2011 8 / 30 Radial Basis Function Networks Radial Basis Function Networks Radial basis function exact interpolation ΦwT = t if Φ is a non singular matrix the solution for the parameters can be found simply by: w = Φ−1 t () Radial Basis Function networks for regression and classification April 5, 2011 9 / 30 Radial Basis Function Networks Radial Basis Function Networks MATLAB EXPERIMENTS MATLAB EXPERIMENTS () Radial Basis Function networks for regression and classificationApril 5, 2011 10 / 30 Radial Basis Function Networks Radial Basis Function Networks Radial basis function exact interpolation This method can generalized for mapping to multidimensional output space Y ⊆ RC In this case each input vector xn must be mapped exactly onto an output vector tn of components tkn with k = 1, . . . , C Thus the interpolation condition can be written: ΦWT = T Where W is a matrix of dimension C × N while T is a matrix of dimension N × C To find the solution we must invert the matrix Φ and perform a matrix product Φ−1 T () Radial Basis Function networks for regression and classificationApril 5, 2011 11 / 30 Radial Basis Function Networks Radial Basis Function Networks Radial basis function networks We made the following changes to the previous model: 1 2 3 The number M of basis functions φ1 (x), . . . φM (x) is much less than N (number of data points); The centers µj and the width σj of basis functions are determined during the training process; The bias parameters are included in the linear sum. yk (x) = M X wkj φj (x) + wk0 j=1 () Radial Basis Function networks for regression and classificationApril 5, 2011 12 / 30 Radial Basis Function Networks Radial Basis Function Networks Radial basis function networks () Radial Basis Function networks for regression and classificationApril 5, 2011 13 / 30 Radial Basis Function Networks Radial Basis Function Networks Radial basis function networks training Two stage training procedure: 1 2 Unsupervised training to determine the parameters of the basis functions; By fixing the parameters of the basis function we determine the weights (wkj ) by Supervised training () Radial Basis Function networks for regression and classificationApril 5, 2011 14 / 30 Radial Basis Function Networks Radial Basis Function Networks Determining the weights of the network yk (x) = M X wkj φj (x) + wk0 j=1 we can absorb the bias parameters into the weights to give: yk (x) = M X wkj φj (x) j=0 where the function φ0 (x) = 1 () Radial Basis Function networks for regression and classificationApril 5, 2011 15 / 30 Radial Basis Function Networks Radial Basis Function Networks Determining the weights of the network Consider a Training Set consisting of N data point and rbf network with M basis functions (internal nodes) We can construct the matrix Φ of dimension N × (M + 1) as follows: Φ10 Φ11 Φ20 Φ21 .... .. Φ12 Φ22 .. . ΦN0 ΦN1 ΦN2 () . . . Φ1M . . . Φ2M .. .. . . . . . ΦNM Radial Basis Function networks for regression and classificationApril 5, 2011 16 / 30 Radial Basis Function Networks Radial Basis Function Networks Determining the weights of the network The expression for the output of the rbf network yk (x) = M X wkj φj (x) j=0 can be expressed as: Y = ΦWT where Y, Φ and W are matrix of dimension N × C , N × M and C × M respectively; () Radial Basis Function networks for regression and classificationApril 5, 2011 17 / 30 Radial Basis Function Networks Radial Basis Function Networks Determining the weights of the network Y = ΦWT We can find the parameters W by minimizing a suitable error function (e.g. sum-of-squares) N C 1 XX E= {yk (xn ) − tkn }2 2 n=1 k=1 The weights values are given by the solution of the linear equation: ΦT ΦWT = ΦT T The solution is WT = (ΦT Φ)−1 ΦT T () Radial Basis Function networks for regression and classificationApril 5, 2011 18 / 30 Radial Basis Function Networks Radial Basis Function Networks Determining the weights of the network N E C = 1 XX {yk (xn ) − tkn }2 2 = 1 k ΦWT − T k2 2 n=1 k=1 Deriving with respect to W ∂E W = ΦT ΦWT − T = ΦT ΦWT − ΦT T Setting to zero the derivative and finding W we obtain: WT = (ΦT Φ)−1 ΦT T () Radial Basis Function networks for regression and classificationApril 5, 2011 19 / 30 Radial Basis Function Networks Radial Basis Function Networks Determining the weights of the network It can be showed that the solution for W can be found using the Singular Value Decomposition (SVD) 1 2 3 Decompose the matrix Φ = UΣVT Compute VΣ−1 UT W = VΣ−1 UT T Where U and V are orthonormal matrix of dimension N × N and M × M respectively and Σ is a matrix of dimension N × M with the singular value on the diagonal Note that Σ−1 denote a M × N matrix constructed as follows: ( 1 if Σ(i, i) > 0 Σ−1 (i, i) = Σ(i,i) 0 otherwise () Radial Basis Function networks for regression and classificationApril 5, 2011 20 / 30 Radial Basis Function Networks Radial Basis Function Networks Determining the parameters of the basis functions We consider the case of Gaussian basis function: ! k x − µj k φj (x) = exp − 2σj2 Thus we have to determine the parameters µj and σj for each basis function () Radial Basis Function networks for regression and classificationApril 5, 2011 21 / 30 Radial Basis Function Networks Radial Basis Function Networks Determining centers and width of the basis functions Different approach exists: Subset of data points Orthogonal least squares Clustering algorithms Gaussian mixture models () Radial Basis Function networks for regression and classificationApril 5, 2011 22 / 30 Radial Basis Function Networks Radial Basis Function Networks Clustering algorithms for selecting the parameters of the basis functions If we consider a clustering algorithm for which the number of clusters is predefined (e.g. K-Means): 1 2 3 Set the number of cluster to M and run the clustering algorithm Set the centers of the basis functions equals to the centers of clusters Set the widths (variances) of the basis functions equals to the variances of clusters () Radial Basis Function networks for regression and classificationApril 5, 2011 23 / 30 Radial Basis Function Networks Radial Basis Function Networks How do we select the model complexity ? Choosing a very simple model may give rise to poor results ! (e.g. M = 1) Choosing a very complex model may give rise to over-fitting and thus poor generalization performace ! One technique that is often used to control over-fitting is to still use a complex model but to add a penalty term to the error function in order to discourage the coefficients from reaching large values (regularization): N C λ 1 XX {yk (xn ) − tkn }2 + k w k2 E= 2 2 n=1 k=1 where λ is called regularization coefficient The solution is WT = (ΦT Φ + λI)−1 ΦT T () Radial Basis Function networks for regression and classificationApril 5, 2011 24 / 30 Radial Basis Function Networks Radial Basis Function Networks Determining the weights of the network It can be showed that the solution for W can be found using the Singular Value Decomposition (SVD) 1 2 3 Decompose the matrix Φ = UΣVT Compute VΣ−1 UT W = VΣ−1 UT T Where U and V are orthonormal matrix of dimension N × N and M × M respectively and Σ is a matrix of dimension N × M with the singular value on the diagonal Note that Σ−1 denote a M × N matrix constructed as follows: ( 1 if Σ(i, i) > 0 Σ−1 (i, i) = Σ(i,i)+λ 0 otherwise () Radial Basis Function networks for regression and classificationApril 5, 2011 25 / 30 Radial Basis Function Networks Radial Basis Function Networks MATLAB EXPERIMENTS MATLAB EXPERIMENTS () Radial Basis Function networks for regression and classificationApril 5, 2011 26 / 30 Radial Basis Function Networks Radial Basis Function Networks Supervised Learning The use of unsupervised techniques to determine the basis function parameters is not in general an optimal procedure in so far as the subsequent supervised training is concerned Indeed with unsupervided techniques the setting up of the basis functions takes no account of the target labels In order to obtain best result we should include the target data in the training procedure, that is we should perform supervised training () Radial Basis Function networks for regression and classificationApril 5, 2011 27 / 30 Radial Basis Function Networks Radial Basis Function Networks Supervised Learning yk (x) = M X j=1 k x − µj k wkj exp − 2σj2 ! + wk0 To find the parameters µj and σj we should minimize the error function (e.g. sum-of-squares) with respect to these parameters This can be done by deriving the error function with respect to µj and σj and make use of these derivatives in the Gradient descent optimization algorithm. () Radial Basis Function networks for regression and classificationApril 5, 2011 28 / 30 Radial Basis Function Networks Radial Basis Function Networks Supervised Learning XX k xn − µj k2 ∂E = {yk (xn ) − tkn } wkj exp − ∂σj 2σj2 n ! k XX k xn − µj k2 ∂E = {yk (xn ) − tkn } wkj exp − ∂µji 2σj2 n k () k xn − µj k2 σj3 ! (xin − µji ) σj2 Radial Basis Function networks for regression and classificationApril 5, 2011 29 / 30 Radial Basis Function Networks Radial Basis Function Networks Gradient descent Differentiable error function E depending on parameters Θ = (θ1 , . . . , θS ) 1 We began with some initial guess for Θ (e.g. random) 2 We update the parameters by moving a small distance in the Θ-space in the direction in which E decrease most rapidly (−∇Θ E ) ∂E (τ +1) (τ ) θj = θj − η ∂θj (τ ) Θ where η is called learning rate and is usually taken in the range [0 . . . 1] () Radial Basis Function networks for regression and classificationApril 5, 2011 30 / 30 Model complexity and PCA April 5, 2011 () Model complexity and PCA April 5, 2011 1 / 23 A toy example of regression Outline 1 A toy example of regression 2 Model complexity () Model complexity and PCA April 5, 2011 2 / 23 A toy example of regression A toy example A simple regression problem We introduce some key concept by means of a toy example of regression problem In this toy example we know the regression function u(x) = sin(2πx) The Training Set comprises 10 input data points {xi }10 i=1 spaced uniformly in the range [0, 1] with the corresponding target value {ti = u(xi ) + i }10 i=1 where i is small random noise (drawn from a gaussian distribution) 1 0 −1 0 () 1 Model complexity and PCA April 5, 2011 3 / 23 A toy example of regression A toy example A simple regression problem To approximate the “unknown” function u(x) by means of the function y (x, w): choose the function model (linear, polynomial, neural network, . . . ) determine the parameters w of the model by the learning algorithm (usually this step involve): choosing of an error function; for a given value of w̃ the error function measures the misfit between the function y (x, w̃) and the training data; Learning Algorithm: is the procedure able to select the parameters w that minimizing the error function. () Model complexity and PCA April 5, 2011 4 / 23 A toy example of regression A toy example A first model: Polynomial Polynomial Model: y (x, w) = w0 + w1 x + w2 x 2 + · · · + wM x M = M X wj x j with M ∈ N j=0 Sum of squares Error Function: E (w) = t 1 PN 2 n=1 {y (xn , w) − tn }2 tn y(xn , w) xn x Thus we have to select the value of M and then we have to determine the values of the parameters w. () Model complexity and PCA April 5, 2011 5 / 23 A toy example of regression A toy example A second model: Neural Network in the case of single input and output, identity output function we can write: y (x, w) = M X (2) wj φj (x) with M ∈ N j=0 (1) φj (x) = g wj x where g (·) is some non linear function (e.g. sigmoid, radial basis function RBF networks) Thus we have to select the value of M and then we have to determine the values of the parameters w. () Model complexity and PCA April 5, 2011 6 / 23 Model complexity Outline 1 A toy example of regression 2 Model complexity () Model complexity and PCA April 5, 2011 7 / 23 Model complexity Determining the parameters w Choosen the function model (polinomial, neural network, and so on), chosen the related value of M techniques exist to determine the parameters values w: Maximum Likelihood; Bayesian approach; ... We indicate with w∗ the values of the parameters that minimize the error function for a given model (in our case identified by the value of M) () Model complexity and PCA April 5, 2011 8 / 23 Model complexity What does it happens when we change the model complexity ? 1 0 −1 −1 0 0 −1 −1 1 1 1 0 0 () 1 1 1 0 0 0 Model complexity and PCA 1 April 5, 2011 9 / 23 Model complexity The Over-fitting Problem 1 Training Test 0.5 0 0 3 6 9 Measuring the generalization performance on the Test Set p Root Mean Square error function ERMS = 2E (wm )/N We choose M that give the best generalization performance (that is the minumum error on the test set) () Model complexity and PCA April 5, 2011 10 / 23 Model complexity What does it happens to w when we change the model complexity ? w0∗ w1∗ w2∗ w3∗ w4∗ w5∗ w6∗ w7∗ w8∗ w9∗ () M=0 0.19 M=1 0.28 −1.27 M=6 0.31 7.99 −25.43 17.37 Model complexity and PCA M=9 0.35 232.37 −5321.83 45868.31 −231639.30 640042.26 −1061800.52 1042400.18 −557682.99 125201.43 April 5, 2011 11 / 23 Model complexity How do we select the model complexity ? Choosing a very simple model may give rise to poor results ! (e.g. M = 1 we are using a linear model) Choosing a very complex model may give rise to over-fitting and thus poor generalization performace ! One technique that is often used to control over-fitting is to still use a complex model but to add a penalty term to the error function in order to discourage the coefficients from reaching large values (regularization): N 1X λ Ẽ (w) = {y (x n ; w) − t n }2 + k w k2 2 2 n=1 where λ is called regularization coefficient () Model complexity and PCA April 5, 2011 12 / 23 Model complexity Using Regularization 1 0 0 −1 −1 0 1 () 1 0 Model complexity and PCA 1 April 5, 2011 13 / 23 Model complexity MATLAB EXPERIMENTS MATLAB EXPERIMENTS () Model complexity and PCA April 5, 2011 14 / 23 Model complexity Dimensionality Reduction: 2D example Data 1 x2 0.5 0 −0.5 −1 −1 −0.5 0 0.5 1 x1 () Model complexity and PCA April 5, 2011 15 / 23 Model complexity Dimensionality Reduction: 2D example Data data V1 1 V2 x2 0.5 0 −0.5 −1 −1 −0.5 0 0.5 1 x1 () Model complexity and PCA April 5, 2011 16 / 23 Model complexity Dimensionality Reduction: 2D example Projected data onto V1 −2 −1.5 () −1 −0.5 u1 0 Model complexity and PCA 0.5 1 1.5 April 5, 2011 17 / 23 Model complexity Dimensionality Reduction: 2D example Reconstructed Data 1 x2 0.5 0 −0.5 −1 −1 −0.5 0 0.5 1 x1 () Model complexity and PCA April 5, 2011 18 / 23 Model complexity Dimensionality Reduction: 2D example Reconstructed Data 1 1 0.5 0.5 x2 x2 Data 0 0 −0.5 −0.5 −1 −1 −1 −0.5 0 0.5 1 x1 −1 −0.5 0 0.5 1 x1 Reconstruction error: 0.18 (RMS error) () Model complexity and PCA April 5, 2011 19 / 23 Model complexity Principal Component Analysis Consider a data set of N observations {xn }n=1,...,N where xn ∈ Rd Objective: to project the data onto a k < d dimensional space while maximizing the variance of the projected data k = 1, vector u ∈ Rd uxn projection of the n-th point P n x̄ = N1 N n=1 x sample mean of the data ux̄ projected sample mean () Model complexity and PCA April 5, 2011 20 / 23 Model complexity Principal Component Analysis Variance of projected data: 1 N PN n=1 2 uT xn − uT x̄ = uT Su where S is: S= 1 N PN n n=1 (x − x̄)(xn − x̄) Optimization problem: arg maxu uT Su s.t. k u k2 = 1 () Model complexity and PCA April 5, 2011 21 / 23 Model complexity Principal Component Analysis Using Lagrange multiplier: uT Su + λ(1 − uT u) setting the deriving with respect to u equal to zero (Eigenvalue problem): Su = λu multiplying both sides by uT (note that uT u = 1) uT Su = λ So the variance is maximized when we choose u equal to the eigenvector having the largest eigenvalue λ of S () Model complexity and PCA April 5, 2011 22 / 23 Model complexity PCA Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: function PCA(X , k) Xmean ←computeMean(X ) Xc ← X − Xmean . Centering data CovX ← XcT Xc . Computing covariance matrix [U Lambda] ←diagonalize(CovX ) . Eigenvectors, eigenvalues U ←sortDescending(U, Lambda) . Sorting eigenvectors Uk ←getFirstKcomponents(U, k) . First k eigenvectors Y ← Xc Uk . Projecting data return Y end function () Model complexity and PCA April 5, 2011 23 / 23