Download Optimization (supervised learning)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neuropsychopharmacology wikipedia , lookup

Transcript
What is computer vision?
• The search of the fundamental visual features, and
the two fundamentals applications of reconstruction
and recognition
– Features
– Recognition
– Reconstruction
2012
• The current mania and euphoria of the AI revolution
– 2012, annual gathering, an improvement of 10% (from 75 to 85)
• Computer vision researchers use machine learning techniques to
recognize objects in large amount of images
– Go back to 1998 (1.5 decade!)
• Textual (hand-written and printed) is actually visual!
– And why wait for so long?
• A silent Hardware revolution: GPU
• Sadly driven by video gaming 
– Nvidia (GPU maker) is now in the driving seat of this AI
revolution!
• 2016, AlphaGo beats professionals
• A narrow AI program
• Re-shaping AI, Computer Science, digital revolution …
Visual matching, and recognition for
understanding
• Finding the visually similar things in different images
--- Visual similarities
• Visual matching, find the ‘same’ thing under different
viewpoints, better defined, no semantics per se.
• Visual recognition, find the pre-trained ‘labels’,
semantics
– We define ‘labels’, then ‘learn’ from labeled data, finally
classify ‘labels’
The state-of-the-art of visual classification
and recognition
• Any thing you can clearly define and label
• Then show a few thousands examples (labeled data)
of this thing to the computer
• A computer recognizes a new image, not seen
before, now as good as humans, even better!
• This is done by deep neural networks.
References
 CNN for Visual Recognition, Stanford
http://cs231n.github.io/neural-networks-1/
 Deep Learning Tutorial, LeNet, Montreal
http://www.deeplearning.net/tutorial/mlp.html
 Pattern Recognition and Machine Learning, Bishop
 Sparse and Redundant Representations, Elad
 Pattern Recognition and Neural Networks, Ripley
 Pattern Classification, Duda, Hart, different editions
 A wavelet tour of signal processing, a sparse way, Mallat
 Introduction to applied mathematics, Strang
Some figures and texts in the slides are cut/paste from these references.
Classification and recognition
• Where is it, for the input x?
– Make a decision, either by probability a>b, or by
classification surface f(x)>0 or <0.
– Forward inference
• How to compute?
– Classification surface estimate f(x)=0, an (nonlinear and
high-dimensional) optimization problem (of often a
differentiable log likelihood)
– Backward learning
• What to minimize?
– Justification, often probabilistic, and Bayesian
• A (parameterized) score function mapping the data
to class score, forward inference, modeling
• A loss function (objective) measuring the quality of a
particular set of parameters based on the ground
truth labels
• Optimization, minimize the loss over the parameters
with a regularization, backward learning
The dataset of pairs of (x,y) is given and fixed. The weights start out as
random numbers and can change. During the forward pass the score
function computes class scores, stored in vector f. The loss function
contains two components: The data loss computes the compatibility
between the scores f and the labels y. The regularization loss is only a
function of the weights. During Gradient Descent, we compute the
gradient on the weights (and optionally on data if we wish) and use
them to perform a parameter update during Gradient Descent.
Bayesian decision
 𝑷(𝝎𝒋 𝒙
= 𝑷 𝒙 𝝎𝒋 ) 𝑷(𝝎𝒋 )/𝑷 𝒙
 posterior = likelihood * prior / evidence
 Decide 𝝎𝟏 if 𝑷(𝝎𝟏 𝒙 > 𝑷(𝝎𝟐 𝒙 ; otherwise decide 𝝎𝟐
Optimization (supervised learning)
 Minimize a loss function
 The number of erros, zero-one loss
 The zero-one loss is not differentiable, so we maximize the log-
likelihood, or to minimize the negative log-likelihood
 We use the gradient of this function
 Stochastic gradient descent uses a few examples at a time
instead of the entire training set
 The loss function should be regularized (ill-posed non-uniqueness
solution, or smoothness constraint, or avoid overfitting)
Optimization (supervised learning)
 Training, validation, and testing data
 Hyper-parameters
 Overfitting to the data
 Generalization
 Regularization
VisGraph, HKUST
Fundamental linear classifiers
 Binary linear classifier y = 𝑓 𝒙 = 𝒘 ⋅ 𝒙 + 𝑏
 The classification surface is a hyper-plane 𝑓(𝒙) = 0
 Geometry, 3d, and n-d
 Linear algebra, linear space
 Linear classifiers
 Decision is nonlinear thresholding,
 Nonlinear distance function, or probability-like sigmoid
• A single neuron is a linear classifier
• w x + b, a linear classifier, a neuron,
– It’s a dot product of two vectors, scalar product
– A template matching, a correlation, the template w and the
input vector x
– Also an algebraic distance, not the geometric one which is nonlinear (therefore the solution is usually nonlinear!)
• The dot product is the metric distance of two points, one data, the
other representative
– The ‘linear’ is that the decision surface is linear, a hyper-plane.
The solution or the training is usually not linear at all, it depends
on the loss function (softmax or svm). It is iteratively solved by a
numerical gradient
A biological neuron and its mathematical model.
From two to N classes
 Binary linear classifier y = 𝑓 𝒙 = 𝒘 ⋅ 𝒙 + 𝑏
 The classification surface is a hyper-plane 𝑓(𝒙) = 0
 Multi-class, output a vector function, 𝒚 = 𝒇(𝒙) = 𝒇(𝑾 𝒙 + 𝒃),
 The normalized exponentials (softmax),𝒔(𝒇(𝒙)) = 𝒔 ∘ 𝒇 (𝒙)
 (s is a kind of normalization)
 W x + b, each row is a linear classifier, a neuron
Is a linear classifier straightforward?
• Only inference ‘scoring’ function is linear
• No ‘analytical’ forms of the loss functions
• Not equality, inequalities
VisGraph, HKUST
The two common linear classifiers, with
different loss functions
• SVM, uncalibrated score
• Softmax, multi-class logistic regression, a normalized
class probability for each label
• They are usually comparable
Activation (nonlinearity) functions
• Sigmoid logistic function 𝑠 𝑥 = 1/(1 + 𝑒 −𝑥 ), normalized to
between 0 and 1, is naturallly probability-like between 0 and 1,
– so naturally, sigmoid for two,
– and softmax for N, 𝑒 −𝑥𝑖 /∑𝑒 −𝑥𝑗
– It’s the normalization of the output data, also remember the similar
consideration to the input data normalization (whitening)
– Activation function and Nonlinearity function, not necessarily logistic
sigmoid between 0 and 1, others like tanh (centered), relu, …
• Sigmoid, kills gradients, not used any more
• Tanh, 2 s(2x) – 1, centered between -1 and 1, better,
• ReLU, max(0,x)
– Very popular recently
– Don’t set up too high learning rate
•
Practices:
– Rate to mix different activations
– Use ReLU
– Now typically 100 million parameters and 10 to 20 layers
VisGraph, HKUST
From linear to non-linear classifiers
 Go higher and linear!
 find a map or a transform 𝒙 ↦ 𝜙(𝒙) to make them linear, but in
higher dimensions
 A complete basis of polynomials  too many parameters for
the limited training data
 Kernel methods, support vector machine, …
 Learn the nonlinearity at the same time as the linear classifiers 
multilayer neural networks
 Multilayer neural networks
 They implement linear classifiers, but in a space where the
inputs have been mapped nonlinearly!
 A universal nonlinear approximator, from at least 3 layers (two
hidden layers) onwards
Multi Layer Perceptrons
 A N-layer neural network does not count the input layer
 But it counts the output layer: it represents the class scores vector,




it does not have an activation function, or the identity activation
function
 Activation is a kind of data normalization
Better count the hidden layers 
A one-layer, 𝑓1 x , linear classifiers, then s_1, no hidden layer
A two-layer, 𝑓2 ∘ 𝑠1 ∘ 𝑓1 x , one hidden layer
A three layer network 𝑓3 ∘ 𝑠2 ∘ 𝑓2 ∘ 𝑠1 ∘ 𝑓1 x , two hidden layers
For a model f(x), Forward inference f(x), and backward learning \nabla f(x)
A 2-layer Neural Network, one hidden
layer of 4 neurons (or units), and one
output layer with 2 neurons, and three
inputs.
The network has 4 + 2 = 6 neurons (not
counting the inputs), [3 x 4] + [4 x 2] = 20
weights and 4 + 2 = 6 biases, for a total of
26 learnable parameters.
A 3-layer neural network with three inputs,
two hidden layers of 4 neurons each and
one output layer. Notice that in both cases
there are connections (synapses) between
neurons across layers, but not within a
layer.
The network has 4 + 4 + 1 = 9 neurons, [3
x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32
weights and 4 + 4 + 1 = 9 biases, for a total
of 41 learnable parameters.
From a regular network to CNN: a visual
machine
• The whole network governed by a differentiable loss
function: from the raw pixels to class scores
• Each layer transforms an input to an output with
some differentiable function
• The full connectivity
– It does not scale up the images and layers
– It leads quickly to over-fitting
VisGraph, HKUST
A regular 3-layer Neural Network.
A CNN arranges its neurons in three dimensions (width, height, depth).
Every layer of a CNN transforms the 3D input volume to a 3D output volume.
In this example, the red input layer holds the image, so its width and height
would be the dimensions of the image, and the depth would be 3 (Red,
Green, Blue channels).
• We used to convert an input image into a feature vector,
1D
– That was feature selection
• We now input directly the image, 2D
• The neurons are arranged from 1D to 2D, and to 3D
• Converting input images into feature vector loses the
spatial neighborhood-ness
• complexity increases to cubics
• Yet, the connectivities become local to reduce the
complexity!
VisGraph, HKUST
CNN
• INPUT [32x32x3] will hold the raw pixel values of the image, in this case an
image of width 32, height 32, and with three channels R,G,B.
• CONV layer will compute the output of neurons that are connected to
local regions in the input, each computing a dot product between their
weights and a small region they are connected to in the input volume. This
may result in volume such as [32x32x12] if we decided to use 12 filters.
• RELU layer will apply an elementwise activation function, max(0,x). This
leaves the size of the volume unchanged ([32x32x12]).
• POOL layer will perform a down-sampling along the spatial dimensions
(width, height), resulting in volume such as [16x16x12].
• FC (i.e. fully-connected) layer will compute the class scores, resulting in
volume of size [1x1x10], where each of the 10 numbers correspond to a
class score, such as among the 10 categories of CIFAR-10. Each neuron in
this layer will be connected to all the numbers in the previous volume.
INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
where the * indicates repetition, and the POOL? indicates an
optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >=
0, K >= 0 (and usually K < 3). For example, here are some common
ConvNet architectures you may see that follow this pattern:
The initial volume stores the raw image pixels (left) and the last volume
stores the class scores (right). Each volume of activations along the
processing path is shown as a column. Since it's difficult to visualize 3D
volumes, we lay out each volume's slices in rows. The last layer
volume holds the scores for each class, but here we only visualize the
sorted top 5 scores, and print the labels of each one. The full webbased demo is shown in the header of our website. The architecture
shown here is a tiny VGG Net.
CNN layers
• Some layers do not have parameters, the RELU and
POOl layers implement a fixed function
• Some layers contain parameters, the CONV and FC
layers
VisGraph, HKUST
The Convolutional Layer
• Local connectivity.
– The receptive field of the neuron, or the filter size.
– The connections are local in space (width and height), but
always full in depth.
• A set of learnable filters
• Parameters sharing
VisGraph, HKUST
The “convolution”
• For 3D input images, the convolution is 2D in each
channel, and each channel has a different filter or kernel,
the convolution per channel is then summed up in all
channels to produce a scalar for non-linearity activation
– Do we need a linear combination parameters?
• A convolution can be defined for 1, 2, 3, and N D
• The 2D convolution is different from a real 3D
convolution, which integrates the spatio-temporal
information, the standard CNN convolution has only
‘spatial’ spreading
VisGraph, HKUST
The Pooling Layer
• Reduce the spatial size
• Reduce the amount of parameters
• Avoid over-fitting
• Backpropagation for a max: only routing the gradient to the
input that had the highest value in the forward pass
• It is unclear whether the pooling is essential.
• The data normalization or PCA/whitening is common in
general NN, but in CNN, the ‘normalization layer’ has been
shown to be minimal as well.
VisGraph, HKUST
Computational complexity
• The memory bottleneck
• GPU, a few GB
VisGraph, HKUST
CNN applications
• Transfer learning
• Fine-tuning the CNN
– Keep some early layers
• Early layers contain more generic features, edges, color blobs
• Common to many visual tasks
– Fine-tune the later layers
• More specific to the details of the class
• CNN as feature extractor
– Remove the last fully connected layer
– A kind of descriptor or CNN codes for the image
– AlexNet gives a 4096 Dim descriptor
VisGraph, HKUST
Open questions
• Only empirical that deeper is better
• Images contain hierarchical structures
• Overfitting and generalization
• meaningful data! Intrinsic laws
• Networks are non-convex
– Need regularization
• Smaller networks are hard to train with local methods
– Local minima are bad, in loss, not stable, large variance
• Bigger ones are easier
– More local minima, but better, more stable, small variance
• As big as the computational power, and data!
VisGraph, HKUST