Download Lecture 12

Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural networks  Considerations in neural network modelling  Multivariate Adaptive Regression Splines Data mining and statistical learning lecture 12 Feed forward neural network … f1 Feed-forward neural network • Input layer • Hidden layer(s) • Output layer z1 x1 Data mining and statistical learning lecture 12 z2 x2 fK … … zM xp Terminology • Feed-forward network – Nodes in one layer are connected to the nodes in next layer • Recurrent network – Nodes in one layer may be connected to the ones in previous layer or within the same layer Data mining and statistical learning lecture 12 Multilayer perceptrons • Any number of inputs • Any number of outputs • One or more hidden layers with any number of units. • Linear combinations of the outputs from one layer form inputs to the following layers • Sigmoid activation functions in the hidden layers. Data mining and statistical learning lecture 12 … f1 z1 x1 z2 x2 fK … … zM xp Parameters in a multilayer perceptron z m   ( 0 m  α mT X ), m  1, ... , M  C1 f k  g k (  0 k  β kT Z ), k  1, ... , K  C2 • C1, C2 : combination function • g,  : activation function • 0m 0k : bias of hidden unit … f1 z1 z2 fK … zM • im jk : weight of connection x1 Data mining and statistical learning lecture 12 x2 … xp Least squares fitting of neural networks Consider a simple perceptron (no hidden layer) f k   ( 0k  αkT X )   ( 0k  1k X 1     pk X p ), k  1, , K f1 f2 fK Find weights and bias minimizing the error function K N R( )    f k ( x i )  y ik  k 1 i 1 2 x1 Data mining and statistical learning lecture 12 x2 … xp Alternative measures of fit • For regression we normally use the sum-of-squared errors as measure of fit K N R( )    f k ( x i )  y ik  2 k 1 i 1 • For classification we use either squared errors or cross-entropy (deviance) K N R( )   y ik log f k ( x i ) k 1 i 1 and the corresponding classifier is argmaxk fk(x) • The measure of fit can also be adapted to specific distributions, such as Poisson distributions Data mining and statistical learning lecture 12 Combination and activation functions • Combination function – Linear combination:  0 m    jm x j – Radial combination:   02m  ( jm  x j ) 2 j j • Activation function in the hidden layer – Identity – Sigmoid • Activation function in the output layer – Softmax – Identity g k (T )  exp(Tk ) K  exp(T ) l 1 where Tk   0 k  β kT z k Data mining and statistical learning lecture 12 Ordinary radial basis function networks (ORBF) • Input and output layers and one hidden layer • Hidden layer: Combination function=radial Activation function=exponential, softmax • Output layer: Combination function=linear Activation function =any, normally identity … f1 z1 x1 Data mining and statistical learning lecture 12 z2 x2 fK … … zM xp Issues in neural network modelling • Preliminary training – learning with different initial weights (since multiple local minima are possible) • Scaling of the inputs is important (standardization) • The number of nodes in the hidden layer(s) • The choice of activation function in the output layer – Interval – identity – Nominal – softmax Data mining and statistical learning lecture 12 Overcoming over-fitting 1. Early stopping 2. Adding a penalty function Objective function=Error function+Penalty term   2     im2    mk   im mk  Data mining and statistical learning lecture 12 MARS: Multivariate Adaptive Regression Splines An adaptive procedure for regression that can be regarded as a generalization of stepwise linear regression Data mining and statistical learning lecture 12 Reflected pair of functions with a knot at the value x1 (x-x1)+ 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 x1 (x1-x)+ 0.4 0.6 Data mining and statistical learning lecture 12 0.8 1 Reflected pairs of functions with knots at the values x1 and x2 (x-x1)+ (x1-x)+ 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 x1 0.4 (x-x2)+ 0.6 x2 Data mining and statistical learning lecture 12 (x2-x)+ 0.8 1 MARS with a single input X taking the values x1, …, xN Form the collection C  ( X  t )  , (t  X )  ; t {x1 , x2 ,..., xN } of base functions Construct models of the form M f ( X )   0    m hm ( X ) m 1 where each hm(X) is a function in C or a product of two or more such functions Data mining and statistical learning lecture 12 MARS model with a single input X taking the values x1, x2 E(Y) 3 2.5 2 1.5 1 0.5 0 0 0.2 x1 0.4 0.6 x2 0.8 Data mining and statistical learning lecture 12 1 1.2 MARS model with a single input X taking the values x1, x2 E(Y) 3 2.5 2 1.5 1 0.5 0 0 0.2 x1 0.4 0.6 x2 0.8 Data mining and statistical learning lecture 12 1 1.2 MARS: Multivariate Adaptive Regression Splines At each stage we consider as a new basis function pair all products of functions already in the model with one of the reflected pairs in the set C Although each basis function depends only on a single Xj it is considered as a function over the entire input space Data mining and statistical learning lecture 12 MARS: Multivariate Adaptive Regression Splines - model selection MARS functions typically overfit the data and so a backward deletion procedure is applied The size of the model is determined by Generalized Cross Validation An upper limit can be set on the order of interaction Data mining and statistical learning lecture 12 The MARS model can be viewed as a generalization of the classification and regression tree (CART) 13 12 11 x2 10 9 8 7 6 3 4 5 6 7 x1 Data mining and statistical learning lecture 12 8 Some characteristics of different learning methods Characteristic Neural networks Trees MARS Natural handling of data of “mixed” type Poor Good Good Handling of missing values Poor Good Good Robustness to outliers in input space Poor Good Poor Insensitive to monotone transformations of inputs Poor Good Poor Computational scalability (large N) Poor Good Good Ability to deal with irrelevant inputs Poor Good Good Ability to extract linear combinations of features Good Poor Poor Interpretability Poor Fair Good Predictive power Good Poor Fair Data mining and statistical learning lecture 12 Separating hyperplane 1 x T    0 0.80 x2 0.6 0.4 0.2 0 0 0.2 0.6 0.4 x1 Data mining and statistical learning lecture 12 0.8 1 Optimal separating hyperplane - support vector classifier 1 x T    0 0.8 0 Find the hyperplane that creates the biggest margin between the training points for class 1 and -1 0.6 margin 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Data mining and statistical learning lecture 12 1 Formulation of the optimization problem max  . 0 ,  1 C subject to yi ( x    0 )  C , i  1, ..., N T i Signed distance to decision border y=1 for one of the groups and y=-1 for the other one Data mining and statistical learning lecture 12 Two equivalent formulations of the optimization problem max  . 0 ,  1 C subject to yi ( xiT    0 )  C , i  1, ..., N min  . 0  subject to yi ( xiT    0 )  1, i  1, ..., N Data mining and statistical learning lecture 12 Characteristics of the support vector classifier Points well inside their class boundary do not play a big role in the shaping of the decision border Cf. linear discriminant analysis (LDA) for which the decision boundary is determined by the covariance matrix of the class distributions and their centroids Data mining and statistical learning lecture 12 Support vector machines using basis expansions (polynomials, splines) 1 f ( x)  h( x)T    0 0.80 h2(x) 0.6 0.4 0.2 0 0 0.2 0.6 0.4 h1(x) Data mining and statistical learning lecture 12 0.8 1 Characteristics of support vector machines The dimension of the enlarged feature space can be very large Overfitting is prevented by a built-in shrinkage of beta coefficients Irrelevant inputs can create serious problems Data mining and statistical learning lecture 12 The SVM as a penalization method Misclassification: f(x) < 0 when y=1 or f(x)>0 when y=-1 Loss function: N  [1  y f (x )] i 1 i i  Loss function + penalty: N  [1  yi f (xi )]    2 i 1 Data mining and statistical learning lecture 12 The SVM as a penalization method Minimizing the loss function + penalty N  [1  yi f (xi )]    2 i 1 is equivalent to fitting a support vector machine to data The penalty factor  is a function of the constant providing an upper bound of N  i 1 i Data mining and statistical learning lecture 12 Some characteristics of different learning methods Characteristic Neural networks Support vector machines Trees MARS Natural handling of data of “mixed” type Poor Poor Good Good Handling of missing values Poor Poor Good Good Robustness to outliers in input space Poor Poor Good Poor Insensitive to monotone transformations of inputs Poor Poor Good Poor Computational scalability (large N) Poor Poor Good Good Ability to deal with irrelevant inputs Poor Poor Good Good Ability to extract linear combinations of features Good Good Poor Poor Interpretability Poor Poor Fair Good Predictive power Good Good Poor Fair Data mining and statistical learning lecture 12

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 12