Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Deep learning challenge Multilayer network and spherical spin-glass model Open problem Open Problem: The landscape of the loss surfaces of multilayer networks Anna Choromanska1 Yann LeCun2 , Gérard Ben Arous1 1 Courant 2 Center Institute of Mathematical Sciences, NYU for Data Science, NYU & Facebook AI Research Deep learning challenge Multilayer network and spherical spin-glass model Open problem Introduction Challenge Goal: Understanding loss function in deep learning. Recent related works: Goodfellow et al., 2015, Choromanska et al., 2015, Dauphin et al., 2014, Saxe et al., 2014. Deep learning challenge Multilayer network and spherical spin-glass model Open problem Introduction Challenge Goal: Understanding loss function in deep learning. Recent related works: Goodfellow et al., 2015, Choromanska et al., 2015, Dauphin et al., 2014, Saxe et al., 2014. Questions: Why the result of multiple experiments with multilayer networks consistently give very similar performance despite the presence of many local minima? Deep learning challenge Multilayer network and spherical spin-glass model Open problem Introduction Challenge Goal: Understanding loss function in deep learning. Recent related works: Goodfellow et al., 2015, Choromanska et al., 2015, Dauphin et al., 2014, Saxe et al., 2014. Questions: Why the result of multiple experiments with multilayer networks consistently give very similar performance despite the presence of many local minima? What is the role of saddle points in the optimization problem? Deep learning challenge Multilayer network and spherical spin-glass model Open problem Introduction Challenge Goal: Understanding loss function in deep learning. Recent related works: Goodfellow et al., 2015, Choromanska et al., 2015, Dauphin et al., 2014, Saxe et al., 2014. Questions: Why the result of multiple experiments with multilayer networks consistently give very similar performance despite the presence of many local minima? What is the role of saddle points in the optimization problem? Is the surface of the loss function of multilayer networks structured? Deep learning challenge Multilayer network and spherical spin-glass model Open problem Introduction Multilayer network and spin-glass model Can we use the spin-glass theory to explain the optimization paradigm with large multilayer networks? Deep learning challenge Multilayer network and spherical spin-glass model Open problem Introduction Multilayer network and spin-glass model Can we use the spin-glass theory to explain the optimization paradigm with large multilayer networks? What assumptions need to be made? Deep learning challenge Multilayer network and spherical spin-glass model Open problem Theoretical similarity Loss function in deep learning X1 1 1 X2 2 2 1 2 ... ... ... Xd Y = 1 Λ(H−1)/2 Ψ X i=1 Y ... ... ... n1 n2 nH−1 Xi Ai H Y (k) wi , L(w) = max(0, 1 − Yt Y ), (1) k=1 1 Ψ - # input-output paths, Λ = Ψ H (let Λ ∈ Z+ ), H - depth (k) wi - the weight of the k th segment of the i th path Ai - Bernoulli r.v. denoting path activation (0/1) L(w) - hinge loss (Yt : true data labeling (1/ − 1), w: weights) Deep learning challenge Multilayer network and spherical spin-glass model Theoretical similarity Assumptions Realistic assumptions: Model max operator as a Bernoulli r.v. M (0/1). Open problem Deep learning challenge Multilayer network and spherical spin-glass model Theoretical similarity Assumptions Realistic assumptions: Model max operator as a Bernoulli r.v. M (0/1). Assume each path is active with the same probability. Open problem Deep learning challenge Multilayer network and spherical spin-glass model Theoretical similarity Assumptions Realistic assumptions: Model max operator as a Bernoulli r.v. M (0/1). Assume each path is active with the same probability. Assume Λ weights that are uniformly distributed in the network, i.e. every H-length product of weights appears in Equation 1 (the set of all products is {wi1 wi2 . . . wiH }Λi1 ,i2 ,...,iH =1 ) (network parametrization is redundant [Denil et al., 2013, Denton et al., 2014]). Open problem Deep learning challenge Multilayer network and spherical spin-glass model Theoretical similarity Assumptions Realistic assumptions: Model max operator as a Bernoulli r.v. M (0/1). Assume each path is active with the same probability. Assume Λ weights that are uniformly distributed in the network, i.e. every H-length product of weights appears in Equation 1 (the set of all products is {wi1 wi2 . . . wiH }Λi1 ,i2 ,...,iH =1 ) (network parametrization is redundant [Denil et al., 2013, Denton et al., 2014]). Impose spherical constraint Λ 1X 2 wi = 1. Λ i=1 Open problem Deep learning challenge Multilayer network and spherical spin-glass model Theoretical similarity Assumptions Realistic assumptions: Model max operator as a Bernoulli r.v. M (0/1). Assume each path is active with the same probability. Assume Λ weights that are uniformly distributed in the network, i.e. every H-length product of weights appears in Equation 1 (the set of all products is {wi1 wi2 . . . wiH }Λi1 ,i2 ,...,iH =1 ) (network parametrization is redundant [Denil et al., 2013, Denton et al., 2014]). Impose spherical constraint Λ 1X 2 wi = 1. Λ i=1 Assume Xi is a Gaussian r.v. Open problem Deep learning challenge Multilayer network and spherical spin-glass model Theoretical similarity Assumptions Unrealistic assumptions: Assume the activation mechanism of any path MAi is independent of the input Xi . Open problem Deep learning challenge Multilayer network and spherical spin-glass model Theoretical similarity Assumptions Unrealistic assumptions: Assume the activation mechanism of any path MAi is independent of the input Xi . Assume paths have independent input data. Open problem Deep learning challenge Multilayer network and spherical spin-glass model Open problem Theoretical similarity Loss function in deep learning after assumptions E[L(w)] ∝ 1 Λ X Λ(H−1)/2 | i1 ,i2 ,...,iH =1 Xi1 ,i2 ,...,iH wi1 wi2 . . . wiH . {z Hamiltonian of the spherical spin-glass model } Deep learning challenge Multilayer network and spherical spin-glass model Open problem Theoretical similarity Loss function in deep learning after assumptions E[L(w)] ∝ 1 Λ X Λ(H−1)/2 | i1 ,i2 ,...,iH =1 Xi1 ,i2 ,...,iH wi1 wi2 . . . wiH . {z Hamiltonian of the spherical spin-glass model } Can we establish a stronger connection between the loss function of the deep model and the spherical spin-glass model by dropping the unrealistic assumptions? Deep learning challenge Multilayer network and spherical spin-glass model Open problem Theoretical similarity Loss function in deep learning after assumptions E[L(w)] ∝ 1 Λ X Λ(H−1)/2 | i1 ,i2 ,...,iH =1 Xi1 ,i2 ,...,iH wi1 wi2 . . . wiH . {z Hamiltonian of the spherical spin-glass model } Can we establish a stronger connection between the loss function of the deep model and the spherical spin-glass model by dropping the unrealistic assumptions? Question: What happens for large-size models (Λ → ∞)? Answer: The landscape becomes structured. Deep learning challenge Multilayer network and spherical spin-glass model Open problem Empirical similarity Multilayer net vs spherical spin-glass [Chorom. et al., 2015] 125 60 Lambda 25 50 100 250 500 75 50 nhidden 25 50 100 250 500 40 count count 100 20 25 0 −1.6 −1.5 loss −1.4 −1.3 0.08 0.09 0.10 loss 10 25 50 100 nhidden 250 500 0e+00 0.10 test loss 0.15 0.20 test loss variance 3e−05 6e−05 0.25 0 0 100 200 300 nhidden 400 500 Deep learning challenge Multilayer network and spherical spin-glass model Open problem Conjectures Spherical spin-glass versus deep network Conjecture (Deep learning) For large-size networks, most local minima are equivalent and yield similar performance on a test set. Conjecture (Deep learning) The probability of finding a “bad” (high value of loss) local minimum is non-zero for small-size networks and decreases with network size. Spherical spin-glass Critical points form an ordered structure such that there exists an energy barrier ΛE−∞ (a certain value of the Hamiltonian) below which with overwhelming probability one can find only low-index critical points (and they cannot be found above the barrier as opposed to high-index critical points) most of which are concentrated close to the barrier. Deep learning challenge Multilayer network and spherical spin-glass model Open problem Conjectures Spherical spin-glass versus deep network Conjecture (Deep learning) Saddle points play a key-role in the optimization problem in deep learning. Mean number of critical points 3 2 −Λ E0 −Λ Einf 1 0 −1500 −1000 −500 Λu 0 Mean number of low−index critical points Spherical spin-glass With overwhelming probability one can find only high-index saddle points above energy ΛE−∞ and there are exponentially many of those. 151 8 x 10 x 10 2.5 2 1.5 1 0.5 k=0 k=1 k=2 k=3 k=4 k=5 −Λ E0 −Λ Einf 0 −1655−1650−1645−1640−1635 Λu line: u = −ΛE0 (H) (ground Figure : H = 3 and Λ = 1000. Black red line: u = −ΛE∞ (H) (energy barrier). state), Deep learning challenge Multilayer network and spherical spin-glass model Open problem Conjectures Spherical spin-glass versus deep network Conjecture (Deep learning) Struggling to find the global minimum on the training set (as opposed to one of the many good local ones) is not useful in practice and may lead to overfitting. n1 ρ 25 0.7616 50 0.6861 100 0.5983 250 0.5302 500 0.4081 Table : Pearson correlation between training and test loss for networks of different size. MNIST dataset. Spherical spin-glass Recovering the ground state, i.e. global minimum, takes exponentially long time. Deep learning challenge Multilayer network and spherical spin-glass model Open problem Question Open problem Can we establish a stronger connection between the loss function of the deep model and the spherical spin-glass model by dropping the unrealistic assumptions?