Download Deep Learning Basics Lecture 1: Feedforward Princeton University COS 495 Instructor: Yingyu Liang

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Deep Learning Basics
Lecture 1: Feedforward
Princeton University COS 495
Instructor: Yingyu Liang
Motivation I: representation learning
Machine learning 1-2-3
β€’ Collect data and extract features
β€’ Build model: choose hypothesis class 𝓗 and loss function 𝑙
β€’ Optimization: minimize the empirical loss
Features
π‘₯
Color Histogram
Extract
features
build
hypothesis
Red
Green
Blue
𝑦 = π‘€π‘‡πœ™ π‘₯
Features: part of the model
Nonlinear model
build
hypothesis
Linear model
𝑦 = π‘€π‘‡πœ™ π‘₯
Example: Polynomial kernel SVM
π‘₯1
𝑦 = sign(𝑀 𝑇 πœ™(π‘₯) + 𝑏)
π‘₯2
Fixed πœ™ π‘₯
Motivation: representation learning
β€’ Why don’t we also learn πœ™ π‘₯ ?
Learn πœ™ π‘₯
πœ™ π‘₯
Learn 𝑀
𝑦 = π‘€π‘‡πœ™ π‘₯
π‘₯
Feedforward networks
β€’ View each dimension of πœ™ π‘₯ as something to be learned
…
𝑦 = π‘€π‘‡πœ™ π‘₯
…
πœ™ π‘₯
π‘₯
Feedforward networks
β€’ Linear functions πœ™π‘– π‘₯ = πœƒπ‘–π‘‡ π‘₯ don’t work: need some nonlinearity
…
𝑦 = π‘€π‘‡πœ™ π‘₯
…
πœ™ π‘₯
π‘₯
Feedforward networks
β€’ Typically, set πœ™π‘– π‘₯ = π‘Ÿ(πœƒπ‘–π‘‡ π‘₯) where π‘Ÿ(β‹…) is some nonlinear function
…
𝑦 = π‘€π‘‡πœ™ π‘₯
…
πœ™ π‘₯
π‘₯
Feedforward deep networks
β€’ What if we go deeper?
…
…
……
𝑦
…
…
β„Ž1
β„Ž2
β„ŽπΏ
π‘₯
Figure from
Deep learning, by
Goodfellow, Bengio, Courville.
Dark boxes are things to be learned.
Motivation II: neurons
Motivation: neurons
Figure from
Wikipedia
Motivation: abstract neuron model
β€’ Neuron activated when the correlation
between the input and a pattern πœƒ
exceeds some threshold 𝑏
β€’ 𝑦 = threshold(πœƒ 𝑇 π‘₯ βˆ’ 𝑏)
or 𝑦 = π‘Ÿ(πœƒ 𝑇 π‘₯ βˆ’ 𝑏)
β€’ π‘Ÿ(β‹…) called activation function
π‘₯1
π‘₯2
𝑦
π‘₯𝑑
Motivation: artificial neural networks
Motivation: artificial neural networks
β€’ Put into layers: feedforward deep networks
…
…
……
𝑦
…
…
β„Ž1
β„Ž2
β„ŽπΏ
π‘₯
Components in Feedforward
networks
Components
β€’ Representations:
β€’ Input
β€’ Hidden variables
β€’ Layers/weights:
β€’ Hidden layers
β€’ Output layer
Components
First layer
Output layer
…
…
……
𝑦
…
…
Input π‘₯
Hidden variables β„Ž1
β„Ž2
β„ŽπΏ
Input
β€’ Represented as a vector
β€’ Sometimes require some
preprocessing, e.g.,
β€’ Subtract mean
β€’ Normalize to [-1,1]
Expand
Output layers
Output layer
π‘€π‘‡β„Ž
β€’ Regression: 𝑦 =
+𝑏
β€’ Linear units: no nonlinearity
𝑦
β„Ž
Output layers
β€’ Multi-dimensional regression: 𝑦 =
β€’ Linear units: no nonlinearity
π‘Šπ‘‡β„Ž
Output layer
+𝑏
𝑦
β„Ž
Output layers
Output layer
𝜎(𝑀 𝑇 β„Ž
β€’ Binary classification: 𝑦 =
+ 𝑏)
β€’ Corresponds to using logistic regression on β„Ž
𝑦
β„Ž
Output layers
Output layer
β€’ Multi-class classification:
β€’ 𝑦 = softmax 𝑧 where 𝑧 = π‘Š 𝑇 β„Ž + 𝑏
β€’ Corresponds to using multi-class
logistic regression on β„Ž
𝑧
β„Ž
𝑦
Hidden layers
…
β€’ Neuron take weighted linear
combination of the previous
layer
β€’ So can think of outputting one
value for the next layer
…
β„Žπ‘–
β„Žπ‘–+1
Hidden layers
β€’ 𝑦 = π‘Ÿ(𝑀 𝑇 π‘₯ + 𝑏)
β€’ Typical activation function π‘Ÿ
β€’ Threshold t 𝑧 = 𝕀[𝑧 β‰₯ 0]
β€’ Sigmoid 𝜎 𝑧 = 1/ 1 + exp(βˆ’π‘§)
β€’ Tanh tanh 𝑧 = 2𝜎 2𝑧 βˆ’ 1
π‘Ÿ(β‹…)
π‘₯
𝑦
Hidden layers
β€’ Problem: saturation
π‘Ÿ(β‹…)
Too small gradient
π‘₯
𝑦
Figure borrowed from Pattern Recognition and Machine Learning, Bishop
Hidden layers
β€’ Activation function ReLU (rectified linear unit)
β€’ ReLU 𝑧 = max{𝑧, 0}
Figure from Deep learning, by
Goodfellow, Bengio, Courville.
Hidden layers
β€’ Activation function ReLU (rectified linear unit)
β€’ ReLU 𝑧 = max{𝑧, 0}
Gradient 0
Gradient 1
Hidden layers
β€’ Generalizations of ReLU gReLU 𝑧 = max 𝑧, 0 + 𝛼 min{𝑧, 0}
β€’ Leaky-ReLU 𝑧 = max{𝑧, 0} + 0.01 min{𝑧, 0}
β€’ Parametric-ReLU 𝑧 : 𝛼 learnable
gReLU 𝑧
𝑧
Related documents