Download Open Problem: The landscape of the loss surfaces of multilayer

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Open Problem:
The landscape of the loss surfaces of multilayer
networks
Anna Choromanska1
Yann LeCun2 , Gérard Ben Arous1
1 Courant
2 Center
Institute of Mathematical Sciences, NYU
for Data Science, NYU & Facebook AI Research
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Introduction
Challenge
Goal: Understanding loss function in deep learning.
Recent related works: Goodfellow et al., 2015, Choromanska et
al., 2015, Dauphin et al., 2014, Saxe et al., 2014.
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Introduction
Challenge
Goal: Understanding loss function in deep learning.
Recent related works: Goodfellow et al., 2015, Choromanska et
al., 2015, Dauphin et al., 2014, Saxe et al., 2014.
Questions:
Why the result of multiple experiments with multilayer
networks consistently give very similar performance despite
the presence of many local minima?
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Introduction
Challenge
Goal: Understanding loss function in deep learning.
Recent related works: Goodfellow et al., 2015, Choromanska et
al., 2015, Dauphin et al., 2014, Saxe et al., 2014.
Questions:
Why the result of multiple experiments with multilayer
networks consistently give very similar performance despite
the presence of many local minima?
What is the role of saddle points in the optimization problem?
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Introduction
Challenge
Goal: Understanding loss function in deep learning.
Recent related works: Goodfellow et al., 2015, Choromanska et
al., 2015, Dauphin et al., 2014, Saxe et al., 2014.
Questions:
Why the result of multiple experiments with multilayer
networks consistently give very similar performance despite
the presence of many local minima?
What is the role of saddle points in the optimization problem?
Is the surface of the loss function of multilayer networks
structured?
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Introduction
Multilayer network and spin-glass model
Can we use the spin-glass theory to explain the optimization
paradigm with large multilayer networks?
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Introduction
Multilayer network and spin-glass model
Can we use the spin-glass theory to explain the optimization
paradigm with large multilayer networks?
What assumptions need to be made?
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Theoretical similarity
Loss function in deep learning
X1
1
1
X2
2
2
1
2
...
...
...
Xd
Y =
1
Λ(H−1)/2
Ψ
X
i=1
Y
...
...
...
n1
n2
nH−1
Xi Ai
H
Y
(k)
wi
,
L(w) = max(0, 1 − Yt Y ), (1)
k=1
1
Ψ - # input-output paths, Λ = Ψ H (let Λ ∈ Z+ ), H - depth
(k)
wi - the weight of the k th segment of the i th path
Ai - Bernoulli r.v. denoting path activation (0/1)
L(w) - hinge loss (Yt : true data labeling (1/ − 1), w: weights)
Deep learning challenge
Multilayer network and spherical spin-glass model
Theoretical similarity
Assumptions
Realistic assumptions:
Model max operator as a Bernoulli r.v. M (0/1).
Open problem
Deep learning challenge
Multilayer network and spherical spin-glass model
Theoretical similarity
Assumptions
Realistic assumptions:
Model max operator as a Bernoulli r.v. M (0/1).
Assume each path is active with the same probability.
Open problem
Deep learning challenge
Multilayer network and spherical spin-glass model
Theoretical similarity
Assumptions
Realistic assumptions:
Model max operator as a Bernoulli r.v. M (0/1).
Assume each path is active with the same probability.
Assume Λ weights that are uniformly distributed in the
network, i.e. every H-length product of weights appears in
Equation 1 (the set of all products is
{wi1 wi2 . . . wiH }Λi1 ,i2 ,...,iH =1 ) (network parametrization is
redundant [Denil et al., 2013, Denton et al., 2014]).
Open problem
Deep learning challenge
Multilayer network and spherical spin-glass model
Theoretical similarity
Assumptions
Realistic assumptions:
Model max operator as a Bernoulli r.v. M (0/1).
Assume each path is active with the same probability.
Assume Λ weights that are uniformly distributed in the
network, i.e. every H-length product of weights appears in
Equation 1 (the set of all products is
{wi1 wi2 . . . wiH }Λi1 ,i2 ,...,iH =1 ) (network parametrization is
redundant [Denil et al., 2013, Denton et al., 2014]).
Impose spherical constraint
Λ
1X 2
wi = 1.
Λ
i=1
Open problem
Deep learning challenge
Multilayer network and spherical spin-glass model
Theoretical similarity
Assumptions
Realistic assumptions:
Model max operator as a Bernoulli r.v. M (0/1).
Assume each path is active with the same probability.
Assume Λ weights that are uniformly distributed in the
network, i.e. every H-length product of weights appears in
Equation 1 (the set of all products is
{wi1 wi2 . . . wiH }Λi1 ,i2 ,...,iH =1 ) (network parametrization is
redundant [Denil et al., 2013, Denton et al., 2014]).
Impose spherical constraint
Λ
1X 2
wi = 1.
Λ
i=1
Assume Xi is a Gaussian r.v.
Open problem
Deep learning challenge
Multilayer network and spherical spin-glass model
Theoretical similarity
Assumptions
Unrealistic assumptions:
Assume the activation mechanism of any path MAi is
independent of the input Xi .
Open problem
Deep learning challenge
Multilayer network and spherical spin-glass model
Theoretical similarity
Assumptions
Unrealistic assumptions:
Assume the activation mechanism of any path MAi is
independent of the input Xi .
Assume paths have independent input data.
Open problem
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Theoretical similarity
Loss function in deep learning after assumptions
E[L(w)] ∝
1
Λ
X
Λ(H−1)/2
|
i1 ,i2 ,...,iH =1
Xi1 ,i2 ,...,iH wi1 wi2 . . . wiH .
{z
Hamiltonian of the spherical spin-glass model
}
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Theoretical similarity
Loss function in deep learning after assumptions
E[L(w)] ∝
1
Λ
X
Λ(H−1)/2
|
i1 ,i2 ,...,iH =1
Xi1 ,i2 ,...,iH wi1 wi2 . . . wiH .
{z
Hamiltonian of the spherical spin-glass model
}
Can we establish a stronger connection between the loss
function of the deep model and the spherical spin-glass
model by dropping the unrealistic assumptions?
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Theoretical similarity
Loss function in deep learning after assumptions
E[L(w)] ∝
1
Λ
X
Λ(H−1)/2
|
i1 ,i2 ,...,iH =1
Xi1 ,i2 ,...,iH wi1 wi2 . . . wiH .
{z
Hamiltonian of the spherical spin-glass model
}
Can we establish a stronger connection between the loss
function of the deep model and the spherical spin-glass
model by dropping the unrealistic assumptions?
Question: What happens for large-size models (Λ → ∞)?
Answer: The landscape becomes structured.
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Empirical similarity
Multilayer net vs spherical spin-glass [Chorom. et al., 2015]
125
60
Lambda
25
50
100
250
500
75
50
nhidden
25
50
100
250
500
40
count
count
100
20
25
0
−1.6
−1.5
loss
−1.4
−1.3
0.08
0.09
0.10
loss
10
25
50
100
nhidden
250
500
0e+00
0.10
test loss
0.15 0.20
test loss variance
3e−05
6e−05
0.25
0
0
100
200
300
nhidden
400
500
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Conjectures
Spherical spin-glass versus deep network
Conjecture (Deep learning)
For large-size networks, most local minima are equivalent and yield
similar performance on a test set.
Conjecture (Deep learning)
The probability of finding a “bad” (high value of loss) local
minimum is non-zero for small-size networks and decreases with
network size.
Spherical spin-glass
Critical points form an ordered structure such that there exists an
energy barrier ΛE−∞ (a certain value of the Hamiltonian) below
which with overwhelming probability one can find only low-index
critical points (and they cannot be found above the barrier as
opposed to high-index critical points) most of which are
concentrated close to the barrier.
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Conjectures
Spherical spin-glass versus deep network
Conjecture (Deep learning)
Saddle points play a key-role in the optimization problem in deep
learning.
Mean number of
critical points
3
2
−Λ E0
−Λ Einf
1
0
−1500
−1000 −500
Λu
0
Mean number of low−index
critical points
Spherical spin-glass
With overwhelming probability one can find only high-index saddle
points above energy ΛE−∞ and there are exponentially many of
those.
151
8
x 10
x 10
2.5
2
1.5
1
0.5
k=0
k=1
k=2
k=3
k=4
k=5
−Λ E0
−Λ Einf
0
−1655−1650−1645−1640−1635
Λu
line: u = −ΛE0 (H) (ground
Figure : H = 3 and Λ = 1000. Black
red line: u = −ΛE∞ (H) (energy barrier).
state),
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Conjectures
Spherical spin-glass versus deep network
Conjecture (Deep learning)
Struggling to find the global minimum on the training set (as
opposed to one of the many good local ones) is not useful in
practice and may lead to overfitting.
n1
ρ
25
0.7616
50
0.6861
100
0.5983
250
0.5302
500
0.4081
Table : Pearson correlation between training and test loss for networks of
different size. MNIST dataset.
Spherical spin-glass
Recovering the ground state, i.e. global minimum, takes
exponentially long time.
Deep learning challenge
Multilayer network and spherical spin-glass model
Open problem
Question
Open problem
Can we establish a stronger connection between the loss
function of the deep model and the spherical spin-glass
model by dropping the unrealistic assumptions?
Related documents