Entropy-SGD
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.
Keywords for this software
References in zbMATH (referenced in 21 articles , 1 standard article )
Showing results 1 to 20 of 21.
Sorted by year (- Baskerville, Nicholas P.; Keating, Jonathan P.; Mezzadri, Francesco; Najnudel, Joseph: A spin glass model for the loss surfaces of generative adversarial networks (2022)
- Heaton, Howard; Fung, Samy Wu; Lin, Alex Tong; Osher, Stanley; Yin, Wotao: Wasserstein-based projections with applications to inverse problems (2022)
- Rudin, Cynthia; Chen, Chaofan; Chen, Zhi; Huang, Haiyang; Semenova, Lesia; Zhong, Chudi: Interpretable machine learning: fundamental principles and 10 grand challenges (2022)
- Choudhury, Sayantan; Dutta, Ankan; Ray, Debisree: Chaos and complexity from quantum neural network. A study with diffusion metric in machine learning (2021)
- Cooper, Yaim: Global minima of overparameterized neural networks (2021)
- Darbon, Jérôme; Langlois, Gabriel P.: On Bayesian posterior mean estimators in imaging sciences and Hamilton-Jacobi partial differential equations (2021)
- Molitor, Denali; Needell, Deanna; Ward, Rachel: Bias of homotopic gradient descent for the hinge loss (2021)
- Pittorino, Fabrizio; Lucibello, Carlo; Feinauer, Christoph; Perugini, Gabriele; Baldassi, Carlo; Demyanenko, Elizaveta; Zecchina, Riccardo: Entropic gradient descent algorithms and wide flat minima (2021)
- Goldt, Sebastian; Advani, Madhu S.; Saxe, Andrew M.; Krzakala, Florent; Zdeborová, Lenka: Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup* (2020)
- Liu, Hailiang; Markowich, Peter: Selection dynamics for deep neural networks (2020)
- Sun, Ruo-Yu: Optimization for deep learning: an overview (2020)
- Zhang, Linan; Schaeffer, Hayden: Forward stability of ResNet and its variants (2020)
- Aubin, Benjamin; Maillard, Antoine; Barbier, Jean; Krzakala, Florent; Macris, Nicolas; Zdeborová, Lenka: The committee machine: computational to statistical gaps in learning a two-layers neural network (2019)
- Baity-Jesi, Marco; Sagun, Levent; Geiger, Mario; Spigler, Stefano; Ben Arous, Gérard; Cammarota, Chiara; LeCun, Yann; Wyart, Matthieu; Biroli, Giulio: Comparing dynamics: deep neural networks versus glassy systems (2019)
- Chaudhari, Pratik; Choromanska, Anna; Soatto, Stefano; LeCun, Yann; Baldassi, Carlo; Borgs, Christian; Chayes, Jennifer; Sagun, Levent; Zecchina, Riccardo: Entropy-SGD: biasing gradient descent into wide valleys (2019)
- Chen, Yifan; Sun, Yuejiao; Yin, Wotao: Run-and-inspect method for nonconvex optimization and global optimality bounds for R-local minimizers (2019)
- Hill, Mitch; Nijkamp, Erik; Zhu, Song-Chun: Building a telescope to look into high-dimensional image spaces (2019)
- Kovachki, Nikola B.; Stuart, Andrew M.: Ensemble Kalman inversion: a derivative-free technique for machine learning tasks (2019)
- Achille, Alessandro; Soatto, Stefano: Emergence of invariance and disentanglement in deep representations (2018)
- Chaudhari, Pratik; Oberman, Adam; Osher, Stanley; Soatto, Stefano; Carlier, Guillaume: Deep relaxation: partial differential equations for optimizing deep neural networks (2018)