Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Radial-Basis Function Networks CS/CMPE 333 – Neural Networks Introduction Typical tasks performed by neural networks are association, classification, and filtering. This categorization has historical significance as well. These tasks involve input-output mappings, and the network is designed to learn the a mapping from knowledge of the problem environment Thus, the design of a neural network can be viewed as a curve-fitting or function approximation problem This viewpoint is the motivation for radial-basis function networks CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 2 Radial-Basis Function Networks RBF are 2 layer networks; input source nodes, hidden neurons with basis functions (nonlinear), and output neurons with linear/nonlinear activation functions The theory of radial-basis function networks is built upon function approximation theory in mathematics RBF networks were first used in 1988. Major work was done by Moody and Darken (1989) and Poggio and Girosi (1990) In RBF networks, the mapping from input to highdimension hidden space is nonlinear, while that from hidden to output space is linear What is the basis for this ? CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 3 Radial-Basis Function Network Φ1(.) w x1 y1 y2 xp ΦM(.) Source nodes Hidden neurons with RBF activation functions Output neurons CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 4 Cover’s Theorem (1) Cover’s theorem (1965) gives the motivation for RBF networks Cover’s theorem on the separability of patterns Complex pattern-classification problems cast in highdimensional space nonlinearly is more likely to be linearly separable than low-dimensional space CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 5 Cover’s Theorem (2) Consider set X of N p-dimensional vectors (input patterns) x1 to xN. Let X+ and X- be a binary partition of X, and φ(x) = [φ1(x), φ2(x),…, φM(x)]T. Cover’s theorem partition (dichotomy) [X+, X-] of X is said to be φseparable if there exist an m-dimensional vector w such that A binary wTφ(x) ≥ 0 when x belong to X+ wTφ(x) < 0 when x belong to X Decision boundary or surface wTφ(x) = 0 CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 6 Cover’s Theorem (3) CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 7 Example (1) Consider the XOR problem to illustrate the significance of φseparability and Cover’s theorem. Define a pair of Gaussian hidden functions φ1(x) = exp(-||x – t1||2) t1 = [1, 1]T φ2(x) = exp(-||x – t2|2|) t1 = [0, 0]T Output of these function for each pattern CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 8 Example (2) CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 9 Function Approximation (1) Function approximation seeks to describe the behavior of complex functions by ensembles of simpler functions Describe f(x) by F(x) F(x) can be described in a compact region of input space by F(x) = Σi=1 N wiφi(x) Such that |f(x) – F(x)| < ε ε can be made arbitrarily small Choice of function φ(.) ? CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 10 Function Approximation (2) Find F(x) that “best” approximates the map/function f. The best approximation is problem dependent, and it can be strict interpolation or good generalization (regularized interpolation). Design decisions of elementary functions φ(.) How to compute the weights w ? How many elementary functions to use (i.e. what should be N)? How to obtain a good generalization ? Choice CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 11 Choice of Elementary Functions φ Let f(x) belongs to the function space L2(Rp) (true for almost all physical systems) We want φ to be a basis of L2 What is meant by a basis? A set of functions φi (i = 1, M) are a basis of L2 if linear superposition of φi can generate any function in L2 . Moreover, they must be linearly independent: w1 φ1 + w2 φ2 +…+ wM φM = 0 iff wi = 0 for all i Demos from Neural and Adaptive Systems book CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 12 Interpolation Problem (1) In general, the map from an input space to an output space is given by f: Rp -> Rq p and q = input and out space dimensions; f = map or hypersurface Strict interpolation problem Given a set of N different points xi (i = 1, N) and a corresponding set of N real numbers di (i = 1, N) find a function F: Rp -> R1 that satisfies the interpolation condition F(xi) = di i = 1, N The function F passes through all the points CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 13 Interpolation Problem (2) A common type of φ(.) is radial-symmetric basis functions F(x) = Σi=1 N wiφ||x – xi|| Substituting and writing in matrix format Фw=d Ф = φji (i, j = 1, N) = interpolation matrix; φji = φ||xj – xi|| w = linear weight vector; d = desired response vector CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 14 Interpolation Problem (3) Ф is known to be positive definite for a certain class of radial-basis functions. Thus, w = Ф-1 d In theory, w can be computed. In practice, however, Ф is close to singular Then what ? Regularization theory to perturb Ф to make it nonsingular But, there is another problem… poor generalization or overfitting CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 15 Ill-Posed Problems Supervised learning is a an ill-posed problem There is not enough information in the training data to reconstruct the input-output mapping uniquely The presence of noise or imprecision in the input data adds uncertainty to the reconstruction of the input-outut mapping To achieve good generalization additional information of the domain is needed In other words, the input-output patterns should exhibit redundancy Redundancy is achieved when the physical generator of data is smooth, and thus can be used to generate redundant inputoutput examples CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 16 Regularization Theory (1) How to make an ill-posed problem well-posed ? By constraining the mapping with additional information (e.g. smoothness) in the form of a nonnegative functional Proposed by Tikhonov in 1963 in the context of function approximation in mathematics CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 17 Regularization Theory (2) Input-output examples: xi, di (i = 1, N) Find the mapping F(x): Rp -> R1 for the input-output examples In regularization theory, F is found by minimizing the cost functional ξ(F) ξ(F) = ξs(F) + λξc(F) Standard error term ξs(F) = 0.5Σi=1 N (di – yi)2 = 0.5Σi=1 N (di – F(xi))2 Regularization term ξc(F) = 0.5||PF(x)||2 P = linear differential operator CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 18 Regularization Theory (3) Regularization term depends on the geometric properties of the approximating function The selection of the operator P is therefore problem dependent based on prior knowledge of the geometric properties of the actual function f(x) (e.g. the smoothness of f(x)) Regularization parameter λ: a positive real number This parameter indicates the sufficiency of the given inputoutput examples in capturing the underlying function f(x) The solution of the regularization problem is a function type F(x) We won’t go into the details of how to find F as it requires good understanding of functional analysis CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 19 Regularization Theory (4) Solution of the regularization problem yields F(x) = 1/λΣi=1 N [di - F(xi)]G(x, xi) = Σi=1 N wiG(x, xi) G(x,xi) = Green’s function centered on xi In matrix form F = Gw or (G – λI)w = d and w = (G – λI)-1d G depends only on the operator P CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 20 Type of Function G(x; xi) If P is translationally invariant then G(x; xi) depends only on the difference of x and xi, i.e. G(x; xi) = G(x - xi) If P is both translationally and rotationally invariant then G(x; xi) depends only on Euclidean norm of the difference vector x - xi, i.e. G(x; xi) = G(||x – xi||) This is a radial-basis function If P is further constrained, and G(x; xi) is positive definite, then we have the Gaussian radial-basis function, i.e. G(x; xi) = exp(- (1/2σ2) ||x – xi||2) CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 21 Regularization Network (1) CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 22 Regularization Network (2) The regularization network is based on the regularized interpolation problem F(x) = Σi=1 N wiG(x, xi) It has 3 layers Input layer of p source nodes, where p is the dimension of the input vector x (or number of independent variables) Hidden layer with N neurons, where N is the number of input-output examples. Each neuron uses the activation function G(x; xi) Output layer with q neurons, where q is the output dimension The unknowns are the weights w (only) from the hidden layer to the output layer CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 23 RBF Networks (in Practice) (1) The regularization network requires N hidden neurons, which becomes computationally expensive for large N The complexity of the network is reduced to obtain an approximate solution to the regularization problem The approximate solution F*(x) is then given by F*(x) = Σi=1 M wiφi(x) φi(x) (i = 1,M) = new set of basis functions; M is typically less than N Using radial basis functions F*(x) = Σi=1 M wiφi(||x – ti||) CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 24 RBF Networks (2) CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 25 RBF Networks (3) Unknowns in the RBF network M, the number of hidden neurons (M < N) The centers ti of the radial-basis functions And the weights w CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 26 How to Train RBF Networks - Learning Normally the training of the hidden layer parameters (number of hidden neurons, centers and variance of Gaussian) is done prior to the training of the weights (i.e. on a different ‘time scale’) This is justified based on the fact that the hidden layer performs a different task (nonlinear) than the output layer weights (linear) The weights are learned by supervised learning using an appropriate algorithm (LMS or BP) The hidden layer parameters are learned by (in general, but not always) unsupervised learning CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 27 Fixed Centers Selected at Random Randomly select M inputs as centers for the activation functions Fix the variance of the Gaussian based on the distance between the selected centers. A radial-basis function centered at ti is then given by φ(||x – ti||) = exp(- M/d2 ||x – ti||2) d = max. distance between the chosen centers The width’ or standard deviation of the functions is fixed, given by σ = d/√2M The linear weights are then computed by solving the regularization problem or by using supervised learning CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 28 Self-Organized Selection of Centers Use a self-organizing or clustering technique to determine the number and centers of the Gaussian functions A common algorithm is the k-means algorithm. This algorithms assigns a label to a vector x by the majority label on the k-nearest neighbors Then compute the weights using a supervised errorcorrection learning such as LMS CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 29 Supervised Selection of Centers All unknown parameters are trained using errorcorrecting supervised learning A gradient descent approach is used to find the minimum of the cost function wrt the weights wi and activation function centers ti and spread of centers σ CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 30 Example (1) Classify between two ‘overlapping’ two-dimensional, Gaussian-distributed patterns Conditional probability density function for the two classes f(x | C1) = 1/2πσ12 exp[-1/2σ12 ||x – μ1||2] μ1 = mean = [0 0]T and σ12 = variance = 1 f(x | C2) = 1/2πσ22 exp[-1/2σ22 ||x – μ2||2] μ2 = mean = [2 0]T and σ22 = variance = 4 = [x1 x2]T = two dimensional input C1 and C2 = class labels x CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 31 Example (2) CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 32 Example (3) CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 33 Example (4) Consider a two-input, M hidden neurons, and twooutput RBF Decision rule: an input x is classified to C1 if y1 >= 0 The training set is generated from the probability distribution functions Using the perceptron algorithm, the network is trained for minimum mean-square-error The testing set is generated from the probability distribution functions The trained network is tested for correct classification For other implementation details, see the Matlab code CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 34 Example: Function Approximation (1) Approximate relationship between a car’s fuel economy (in miles per gallon) and its characteristics Input data description: 9 independent discrete valued, boolean, and continuous variables X1: number of cylinders X2: displacement X3: horsepower X4: weight X5: acceleration X6: model year X7: Made in US? (0,1) X8: Made in Europe? (0,1) X9: Made in Japan? (0,1) Output f(X) is fuel economy in miles per gallon CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 35 Example: Function Approximation (2) Using the NNET toolbox, create and train a RBF network with function newrb The function parameters allows you to set the meansquared-error goal of the training, the spread of the radial-basis functions, and the maximum number of hidden layer neurons. Newrb uses the following approach to find the unknowns (it is a self-organizing approach) Start with one hidden neuron; compute network error Add another neuron with center equal to input vector that produced the maximum error; compute network error If network error does not improve significantly, stop; other go to previous step and add another neuron CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 36 Comparison of RBF Network and MLP (1) Both are universal approximators. Thus, a RBF network exists for every MLP, and vice versa An RBF has a single hidden layer, while an MLP can have multiple hidden layers The model of the computational neurons of an MLP are all identical, while the neurons in the hidden and output layers of an RBF network have different models The activation functions of the hidden nodes of an RBF network is based on the Euclidean norm of the input wrt to a center, while that of an MLP is based on the inner product of input and weights CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 37 Comparison of RBF Network and MLP (2) MLPs construct global approximations to nonlinear input-output mapping. This is a consequence of the global activation function (sigmoidal) used in MLPs As a result, MLP can perform generalization in regions where input data is not available (i.e. extrapolation) RBF networks construct local approximations to inputoutput data. This is a consequence of the local Gaussian functions As a result, RBF networks are capable of fast learning from the training data CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 38