Download Deep Learning Basics Lecture 1: Feedforward Princeton University COS 495 Instructor: Yingyu Liang

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Deep Learning Basics
Lecture 1: Feedforward
Princeton University COS 495
Instructor: Yingyu Liang
Motivation I: representation learning
Machine learning 1-2-3
β€’ Collect data and extract features
β€’ Build model: choose hypothesis class 𝓗 and loss function 𝑙
β€’ Optimization: minimize the empirical loss
Features
π‘₯
Color Histogram
Extract
features
build
hypothesis
Red
Green
Blue
𝑦 = π‘€π‘‡πœ™ π‘₯
Features: part of the model
Nonlinear model
build
hypothesis
Linear model
𝑦 = π‘€π‘‡πœ™ π‘₯
Example: Polynomial kernel SVM
π‘₯1
𝑦 = sign(𝑀 𝑇 πœ™(π‘₯) + 𝑏)
π‘₯2
Fixed πœ™ π‘₯
Motivation: representation learning
β€’ Why don’t we also learn πœ™ π‘₯ ?
Learn πœ™ π‘₯
πœ™ π‘₯
Learn 𝑀
𝑦 = π‘€π‘‡πœ™ π‘₯
π‘₯
Feedforward networks
β€’ View each dimension of πœ™ π‘₯ as something to be learned
…
𝑦 = π‘€π‘‡πœ™ π‘₯
…
πœ™ π‘₯
π‘₯
Feedforward networks
β€’ Linear functions πœ™π‘– π‘₯ = πœƒπ‘–π‘‡ π‘₯ don’t work: need some nonlinearity
…
𝑦 = π‘€π‘‡πœ™ π‘₯
…
πœ™ π‘₯
π‘₯
Feedforward networks
β€’ Typically, set πœ™π‘– π‘₯ = π‘Ÿ(πœƒπ‘–π‘‡ π‘₯) where π‘Ÿ(β‹…) is some nonlinear function
…
𝑦 = π‘€π‘‡πœ™ π‘₯
…
πœ™ π‘₯
π‘₯
Feedforward deep networks
β€’ What if we go deeper?
…
…
……
𝑦
…
…
β„Ž1
β„Ž2
β„ŽπΏ
π‘₯
Figure from
Deep learning, by
Goodfellow, Bengio, Courville.
Dark boxes are things to be learned.
Motivation II: neurons
Motivation: neurons
Figure from
Wikipedia
Motivation: abstract neuron model
β€’ Neuron activated when the correlation
between the input and a pattern πœƒ
exceeds some threshold 𝑏
β€’ 𝑦 = threshold(πœƒ 𝑇 π‘₯ βˆ’ 𝑏)
or 𝑦 = π‘Ÿ(πœƒ 𝑇 π‘₯ βˆ’ 𝑏)
β€’ π‘Ÿ(β‹…) called activation function
π‘₯1
π‘₯2
𝑦
π‘₯𝑑
Motivation: artificial neural networks
Motivation: artificial neural networks
β€’ Put into layers: feedforward deep networks
…
…
……
𝑦
…
…
β„Ž1
β„Ž2
β„ŽπΏ
π‘₯
Components in Feedforward
networks
Components
β€’ Representations:
β€’ Input
β€’ Hidden variables
β€’ Layers/weights:
β€’ Hidden layers
β€’ Output layer
Components
First layer
Output layer
…
…
……
𝑦
…
…
Input π‘₯
Hidden variables β„Ž1
β„Ž2
β„ŽπΏ
Input
β€’ Represented as a vector
β€’ Sometimes require some
preprocessing, e.g.,
β€’ Subtract mean
β€’ Normalize to [-1,1]
Expand
Output layers
Output layer
π‘€π‘‡β„Ž
β€’ Regression: 𝑦 =
+𝑏
β€’ Linear units: no nonlinearity
𝑦
β„Ž
Output layers
β€’ Multi-dimensional regression: 𝑦 =
β€’ Linear units: no nonlinearity
π‘Šπ‘‡β„Ž
Output layer
+𝑏
𝑦
β„Ž
Output layers
Output layer
𝜎(𝑀 𝑇 β„Ž
β€’ Binary classification: 𝑦 =
+ 𝑏)
β€’ Corresponds to using logistic regression on β„Ž
𝑦
β„Ž
Output layers
Output layer
β€’ Multi-class classification:
β€’ 𝑦 = softmax 𝑧 where 𝑧 = π‘Š 𝑇 β„Ž + 𝑏
β€’ Corresponds to using multi-class
logistic regression on β„Ž
𝑧
β„Ž
𝑦
Hidden layers
…
β€’ Neuron take weighted linear
combination of the previous
layer
β€’ So can think of outputting one
value for the next layer
…
β„Žπ‘–
β„Žπ‘–+1
Hidden layers
β€’ 𝑦 = π‘Ÿ(𝑀 𝑇 π‘₯ + 𝑏)
β€’ Typical activation function π‘Ÿ
β€’ Threshold t 𝑧 = 𝕀[𝑧 β‰₯ 0]
β€’ Sigmoid 𝜎 𝑧 = 1/ 1 + exp(βˆ’π‘§)
β€’ Tanh tanh 𝑧 = 2𝜎 2𝑧 βˆ’ 1
π‘Ÿ(β‹…)
π‘₯
𝑦
Hidden layers
β€’ Problem: saturation
π‘Ÿ(β‹…)
Too small gradient
π‘₯
𝑦
Figure borrowed from Pattern Recognition and Machine Learning, Bishop
Hidden layers
β€’ Activation function ReLU (rectified linear unit)
β€’ ReLU 𝑧 = max{𝑧, 0}
Figure from Deep learning, by
Goodfellow, Bengio, Courville.
Hidden layers
β€’ Activation function ReLU (rectified linear unit)
β€’ ReLU 𝑧 = max{𝑧, 0}
Gradient 0
Gradient 1
Hidden layers
β€’ Generalizations of ReLU gReLU 𝑧 = max 𝑧, 0 + 𝛼 min{𝑧, 0}
β€’ Leaky-ReLU 𝑧 = max{𝑧, 0} + 0.01 min{𝑧, 0}
β€’ Parametric-ReLU 𝑧 : 𝛼 learnable
gReLU 𝑧
𝑧
Related documents