Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Deep Learning Basics
Lecture 1: Feedforward
Princeton University COS 495
Instructor: Yingyu Liang
Motivation I: representation learning
Machine learning 1-2-3
β’ Collect data and extract features
β’ Build model: choose hypothesis class π and loss function π
β’ Optimization: minimize the empirical loss
Features
π₯
Color Histogram
Extract
features
build
hypothesis
Red
Green
Blue
π¦ = π€ππ π₯
Features: part of the model
Nonlinear model
build
hypothesis
Linear model
π¦ = π€ππ π₯
Example: Polynomial kernel SVM
π₯1
π¦ = sign(π€ π π(π₯) + π)
π₯2
Fixed π π₯
Motivation: representation learning
β’ Why donβt we also learn π π₯ ?
Learn π π₯
π π₯
Learn π€
π¦ = π€ππ π₯
π₯
Feedforward networks
β’ View each dimension of π π₯ as something to be learned
β¦
π¦ = π€ππ π₯
β¦
π π₯
π₯
Feedforward networks
β’ Linear functions ππ π₯ = πππ π₯ donβt work: need some nonlinearity
β¦
π¦ = π€ππ π₯
β¦
π π₯
π₯
Feedforward networks
β’ Typically, set ππ π₯ = π(πππ π₯) where π(β
) is some nonlinear function
β¦
π¦ = π€ππ π₯
β¦
π π₯
π₯
Feedforward deep networks
β’ What if we go deeper?
β¦
β¦
β¦β¦
π¦
β¦
β¦
β1
β2
βπΏ
π₯
Figure from
Deep learning, by
Goodfellow, Bengio, Courville.
Dark boxes are things to be learned.
Motivation II: neurons
Motivation: neurons
Figure from
Wikipedia
Motivation: abstract neuron model
β’ Neuron activated when the correlation
between the input and a pattern π
exceeds some threshold π
β’ π¦ = threshold(π π π₯ β π)
or π¦ = π(π π π₯ β π)
β’ π(β
) called activation function
π₯1
π₯2
π¦
π₯π
Motivation: artificial neural networks
Motivation: artificial neural networks
β’ Put into layers: feedforward deep networks
β¦
β¦
β¦β¦
π¦
β¦
β¦
β1
β2
βπΏ
π₯
Components in Feedforward
networks
Components
β’ Representations:
β’ Input
β’ Hidden variables
β’ Layers/weights:
β’ Hidden layers
β’ Output layer
Components
First layer
Output layer
β¦
β¦
β¦β¦
π¦
β¦
β¦
Input π₯
Hidden variables β1
β2
βπΏ
Input
β’ Represented as a vector
β’ Sometimes require some
preprocessing, e.g.,
β’ Subtract mean
β’ Normalize to [-1,1]
Expand
Output layers
Output layer
π€πβ
β’ Regression: π¦ =
+π
β’ Linear units: no nonlinearity
π¦
β
Output layers
β’ Multi-dimensional regression: π¦ =
β’ Linear units: no nonlinearity
ππβ
Output layer
+π
π¦
β
Output layers
Output layer
π(π€ π β
β’ Binary classification: π¦ =
+ π)
β’ Corresponds to using logistic regression on β
π¦
β
Output layers
Output layer
β’ Multi-class classification:
β’ π¦ = softmax π§ where π§ = π π β + π
β’ Corresponds to using multi-class
logistic regression on β
π§
β
π¦
Hidden layers
β¦
β’ Neuron take weighted linear
combination of the previous
layer
β’ So can think of outputting one
value for the next layer
β¦
βπ
βπ+1
Hidden layers
β’ π¦ = π(π€ π π₯ + π)
β’ Typical activation function π
β’ Threshold t π§ = π[π§ β₯ 0]
β’ Sigmoid π π§ = 1/ 1 + exp(βπ§)
β’ Tanh tanh π§ = 2π 2π§ β 1
π(β
)
π₯
π¦
Hidden layers
β’ Problem: saturation
π(β
)
Too small gradient
π₯
π¦
Figure borrowed from Pattern Recognition and Machine Learning, Bishop
Hidden layers
β’ Activation function ReLU (rectified linear unit)
β’ ReLU π§ = max{π§, 0}
Figure from Deep learning, by
Goodfellow, Bengio, Courville.
Hidden layers
β’ Activation function ReLU (rectified linear unit)
β’ ReLU π§ = max{π§, 0}
Gradient 0
Gradient 1
Hidden layers
β’ Generalizations of ReLU gReLU π§ = max π§, 0 + πΌ min{π§, 0}
β’ Leaky-ReLU π§ = max{π§, 0} + 0.01 min{π§, 0}
β’ Parametric-ReLU π§ : πΌ learnable
gReLU π§
π§