Learning the learning rate for gradient descent by gradient descent

Orchid Majumder; Michele Donini; Pratik Chaudhari

Publication

Learning the learning rate for gradient descent by gradient descent

By Orchid Majumder, Michele Donini, Pratik Chaudhari

2019

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

This paper introduces an algorithm inspired from the work of Franceschi et al. (2017) for automatically tuning the learning rate while training neural networks. We formalize this problem as minimizing a given performance metric (e.g. validation error) at a future epoch using its “hyper-gradient” with respect to the learning rate at the current iteration. Such a hyper-gradient is difficult to estimate and we discuss how approximations and Hessian-vector products allow us to develop a Real-Time method for Hyper-Parameter Optimization (RT-HPO). We present a comparison between RT-HPO and other popular HPO techniques and show that our approach performs better in terms of the final accuracy of the trained model. Online adaptation of the learning introduces two extra hyper-parameters, the initial value of the learning rate and the hyper-learning rate; our empirical results demonstrate that the accuracy obtained by RT-HPO is largely insensitive to these hyper-parameters.

Learning the learning rate for gradient descent by gradient descent

Latest news

Work with us