Download Optimization (supervised learning)

What is computer vision? • The search of the fundamental visual features, and the two fundamentals applications of reconstruction and recognition – Features – Recognition – Reconstruction 2012 • The current mania and euphoria of the AI revolution – 2012, annual gathering, an improvement of 10% (from 75 to 85) • Computer vision researchers use machine learning techniques to recognize objects in large amount of images – Go back to 1998 (1.5 decade!) • Textual (hand-written and printed) is actually visual! – And why wait for so long? • A silent Hardware revolution: GPU • Sadly driven by video gaming  – Nvidia (GPU maker) is now in the driving seat of this AI revolution! • 2016, AlphaGo beats professionals • A narrow AI program • Re-shaping AI, Computer Science, digital revolution … Visual matching, and recognition for understanding • Finding the visually similar things in different images --- Visual similarities • Visual matching, find the ‘same’ thing under different viewpoints, better defined, no semantics per se. • Visual recognition, find the pre-trained ‘labels’, semantics – We define ‘labels’, then ‘learn’ from labeled data, finally classify ‘labels’ The state-of-the-art of visual classification and recognition • Any thing you can clearly define and label • Then show a few thousands examples (labeled data) of this thing to the computer • A computer recognizes a new image, not seen before, now as good as humans, even better! • This is done by deep neural networks. References  CNN for Visual Recognition, Stanford http://cs231n.github.io/neural-networks-1/  Deep Learning Tutorial, LeNet, Montreal http://www.deeplearning.net/tutorial/mlp.html  Pattern Recognition and Machine Learning, Bishop  Sparse and Redundant Representations, Elad  Pattern Recognition and Neural Networks, Ripley  Pattern Classification, Duda, Hart, different editions  A wavelet tour of signal processing, a sparse way, Mallat  Introduction to applied mathematics, Strang Some figures and texts in the slides are cut/paste from these references. Classification and recognition • Where is it, for the input x? – Make a decision, either by probability a>b, or by classification surface f(x)>0 or <0. – Forward inference • How to compute? – Classification surface estimate f(x)=0, an (nonlinear and high-dimensional) optimization problem (of often a differentiable log likelihood) – Backward learning • What to minimize? – Justification, often probabilistic, and Bayesian • A (parameterized) score function mapping the data to class score, forward inference, modeling • A loss function (objective) measuring the quality of a particular set of parameters based on the ground truth labels • Optimization, minimize the loss over the parameters with a regularization, backward learning The dataset of pairs of (x,y) is given and fixed. The weights start out as random numbers and can change. During the forward pass the score function computes class scores, stored in vector f. The loss function contains two components: The data loss computes the compatibility between the scores f and the labels y. The regularization loss is only a function of the weights. During Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them to perform a parameter update during Gradient Descent. Bayesian decision  𝑷(𝝎𝒋 𝒙 = 𝑷 𝒙 𝝎𝒋 ) 𝑷(𝝎𝒋 )/𝑷 𝒙  posterior = likelihood * prior / evidence  Decide 𝝎𝟏 if 𝑷(𝝎𝟏 𝒙 > 𝑷(𝝎𝟐 𝒙 ; otherwise decide 𝝎𝟐 Optimization (supervised learning)  Minimize a loss function  The number of erros, zero-one loss  The zero-one loss is not differentiable, so we maximize the log- likelihood, or to minimize the negative log-likelihood  We use the gradient of this function  Stochastic gradient descent uses a few examples at a time instead of the entire training set  The loss function should be regularized (ill-posed non-uniqueness solution, or smoothness constraint, or avoid overfitting) Optimization (supervised learning)  Training, validation, and testing data  Hyper-parameters  Overfitting to the data  Generalization  Regularization VisGraph, HKUST Fundamental linear classifiers  Binary linear classifier y = 𝑓 𝒙 = 𝒘 ⋅ 𝒙 + 𝑏  The classification surface is a hyper-plane 𝑓(𝒙) = 0  Geometry, 3d, and n-d  Linear algebra, linear space  Linear classifiers  Decision is nonlinear thresholding,  Nonlinear distance function, or probability-like sigmoid • A single neuron is a linear classifier • w x + b, a linear classifier, a neuron, – It’s a dot product of two vectors, scalar product – A template matching, a correlation, the template w and the input vector x – Also an algebraic distance, not the geometric one which is nonlinear (therefore the solution is usually nonlinear!) • The dot product is the metric distance of two points, one data, the other representative – The ‘linear’ is that the decision surface is linear, a hyper-plane. The solution or the training is usually not linear at all, it depends on the loss function (softmax or svm). It is iteratively solved by a numerical gradient A biological neuron and its mathematical model. From two to N classes  Binary linear classifier y = 𝑓 𝒙 = 𝒘 ⋅ 𝒙 + 𝑏  The classification surface is a hyper-plane 𝑓(𝒙) = 0  Multi-class, output a vector function, 𝒚 = 𝒇(𝒙) = 𝒇(𝑾 𝒙 + 𝒃),  The normalized exponentials (softmax),𝒔(𝒇(𝒙)) = 𝒔 ∘ 𝒇 (𝒙)  (s is a kind of normalization)  W x + b, each row is a linear classifier, a neuron Is a linear classifier straightforward? • Only inference ‘scoring’ function is linear • No ‘analytical’ forms of the loss functions • Not equality, inequalities VisGraph, HKUST The two common linear classifiers, with different loss functions • SVM, uncalibrated score • Softmax, multi-class logistic regression, a normalized class probability for each label • They are usually comparable Activation (nonlinearity) functions • Sigmoid logistic function 𝑠 𝑥 = 1/(1 + 𝑒 −𝑥 ), normalized to between 0 and 1, is naturallly probability-like between 0 and 1, – so naturally, sigmoid for two, – and softmax for N, 𝑒 −𝑥𝑖 /∑𝑒 −𝑥𝑗 – It’s the normalization of the output data, also remember the similar consideration to the input data normalization (whitening) – Activation function and Nonlinearity function, not necessarily logistic sigmoid between 0 and 1, others like tanh (centered), relu, … • Sigmoid, kills gradients, not used any more • Tanh, 2 s(2x) – 1, centered between -1 and 1, better, • ReLU, max(0,x) – Very popular recently – Don’t set up too high learning rate • Practices: – Rate to mix different activations – Use ReLU – Now typically 100 million parameters and 10 to 20 layers VisGraph, HKUST From linear to non-linear classifiers  Go higher and linear!  find a map or a transform 𝒙 ↦ 𝜙(𝒙) to make them linear, but in higher dimensions  A complete basis of polynomials  too many parameters for the limited training data  Kernel methods, support vector machine, …  Learn the nonlinearity at the same time as the linear classifiers  multilayer neural networks  Multilayer neural networks  They implement linear classifiers, but in a space where the inputs have been mapped nonlinearly!  A universal nonlinear approximator, from at least 3 layers (two hidden layers) onwards Multi Layer Perceptrons  A N-layer neural network does not count the input layer  But it counts the output layer: it represents the class scores vector,     it does not have an activation function, or the identity activation function  Activation is a kind of data normalization Better count the hidden layers  A one-layer, 𝑓1 x , linear classifiers, then s_1, no hidden layer A two-layer, 𝑓2 ∘ 𝑠1 ∘ 𝑓1 x , one hidden layer A three layer network 𝑓3 ∘ 𝑠2 ∘ 𝑓2 ∘ 𝑠1 ∘ 𝑓1 x , two hidden layers For a model f(x), Forward inference f(x), and backward learning \nabla f(x) A 2-layer Neural Network, one hidden layer of 4 neurons (or units), and one output layer with 2 neurons, and three inputs. The network has 4 + 2 = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters. A 3-layer neural network with three inputs, two hidden layers of 4 neurons each and one output layer. Notice that in both cases there are connections (synapses) between neurons across layers, but not within a layer. The network has 4 + 4 + 1 = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters. From a regular network to CNN: a visual machine • The whole network governed by a differentiable loss function: from the raw pixels to class scores • Each layer transforms an input to an output with some differentiable function • The full connectivity – It does not scale up the images and layers – It leads quickly to over-fitting VisGraph, HKUST A regular 3-layer Neural Network. A CNN arranges its neurons in three dimensions (width, height, depth). Every layer of a CNN transforms the 3D input volume to a 3D output volume. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels). • We used to convert an input image into a feature vector, 1D – That was feature selection • We now input directly the image, 2D • The neurons are arranged from 1D to 2D, and to 3D • Converting input images into feature vector loses the spatial neighborhood-ness • complexity increases to cubics • Yet, the connectivities become local to reduce the complexity! VisGraph, HKUST CNN • INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three channels R,G,B. • CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters. • RELU layer will apply an elementwise activation function, max(0,x). This leaves the size of the volume unchanged ([32x32x12]). • POOL layer will perform a down-sampling along the spatial dimensions (width, height), resulting in volume such as [16x16x12]. • FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. Each neuron in this layer will be connected to all the numbers in the previous volume. INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern: The initial volume stores the raw image pixels (left) and the last volume stores the class scores (right). Each volume of activations along the processing path is shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and print the labels of each one. The full webbased demo is shown in the header of our website. The architecture shown here is a tiny VGG Net. CNN layers • Some layers do not have parameters, the RELU and POOl layers implement a fixed function • Some layers contain parameters, the CONV and FC layers VisGraph, HKUST The Convolutional Layer • Local connectivity. – The receptive field of the neuron, or the filter size. – The connections are local in space (width and height), but always full in depth. • A set of learnable filters • Parameters sharing VisGraph, HKUST The “convolution” • For 3D input images, the convolution is 2D in each channel, and each channel has a different filter or kernel, the convolution per channel is then summed up in all channels to produce a scalar for non-linearity activation – Do we need a linear combination parameters? • A convolution can be defined for 1, 2, 3, and N D • The 2D convolution is different from a real 3D convolution, which integrates the spatio-temporal information, the standard CNN convolution has only ‘spatial’ spreading VisGraph, HKUST The Pooling Layer • Reduce the spatial size • Reduce the amount of parameters • Avoid over-fitting • Backpropagation for a max: only routing the gradient to the input that had the highest value in the forward pass • It is unclear whether the pooling is essential. • The data normalization or PCA/whitening is common in general NN, but in CNN, the ‘normalization layer’ has been shown to be minimal as well. VisGraph, HKUST Computational complexity • The memory bottleneck • GPU, a few GB VisGraph, HKUST CNN applications • Transfer learning • Fine-tuning the CNN – Keep some early layers • Early layers contain more generic features, edges, color blobs • Common to many visual tasks – Fine-tune the later layers • More specific to the details of the class • CNN as feature extractor – Remove the last fully connected layer – A kind of descriptor or CNN codes for the image – AlexNet gives a 4096 Dim descriptor VisGraph, HKUST Open questions • Only empirical that deeper is better • Images contain hierarchical structures • Overfitting and generalization • meaningful data! Intrinsic laws • Networks are non-convex – Need regularization • Smaller networks are hard to train with local methods – Local minima are bad, in loss, not stable, large variance • Bigger ones are easier – More local minima, but better, more stable, small variance • As big as the computational power, and data! VisGraph, HKUST

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Optimization (supervised learning)