Liangchen Luo*, Yuanhao Xiong*, Yan Liu, Xu Sun.
To appear in Proceedings of the 7th International Conference on Learning Representations, New Orleans, Louisiana. 2019.

## Background

Despite their popularity, the generalization ability and out-of-sample behavior of these adaptive methods are likely worse than their non-adaptive counterparts. They often display faster progress in the initial portion of the training, but their performance quickly plateaus on the unseen data (development/test set) (Wilson et al., 2017). Indeed, the optimizer is chosen as Sgd in several recent state-of-the-art works in NLP and CV, wherein these instances Sgd does perform better than adaptive methods. Some researchers devoted to solving this problem in some latest studies. For instance, in the best paper of ICLR 2018 (Reddi et al., 2018), the authors proposed a variant of Adam called AMSGrad, hoping to solve this problem. However, they only provided a theoretical proof of convergence while other researchers found that a considerable performance gap still exists between AMSGrad and Sgd (Keskar & Socher, 2017; Chen et al., 2018).

All of us are hoping to train models faster and better. But it seems that we are not able to achieve this easily by now, or at least we have to fine-tune the hyperparameters of optimizers very carefully. A new optimizer that is as fast as Adam and as good as Sgd is well required by the community.

## Extreme Learning Rates Are Evil

The first attempt to systematically compare the performances of popular adaptive and non-adaptive optimizers is Wilson et al. (2017), which formally pointed out the performance issue of adaptive methods. The authors speculated that the lack of generalization ability of adaptive methods may stem from unstable and extreme learning rates. However, they did not provide further theoretical analysis or empirical study on the hypothesis.

Along this direction, we conduct a preliminary experiment by sampling learning rates of several weights and biases of ResNet-34 on CIFAR-10 using Adam.

We find that when the model is close to convergence, learning rates are composed of tiny ones less than 0.01 as well as huge ones greater than 1000. This observation show that there are indeed learning rates which are too large or too small in the final stage of the training process. But insofar we still have two doubts: (1) does the tiny learning rate really do harm to the convergence of Adam? (2) as the (actual) learning rate highly depends on the initial step size, can we use a relatively larger initial step size to get rid of too small learning rates?

We thereby go further into the theoretical aspect and prove that:

Theorem 3. For any constant $\beta_1, \beta_2 \in [0, 1)$ such that $\beta_1 < \sqrt{\beta_2}$, there is a stochastic convex optimization problem where for any initial step size $\alpha$, Adam does not converge to the optimal solution.

This result illustrates the potential bad impact of extreme learning rates and algorithms are unlikely to achieve good generalization ability without solving this problem. We need to somehow restrict the learning rates of Adam at the final stage of training.

## Applying Dynamic Bound on Learning Rates

To address the problem, we need develop a new variant of optimization methods. Our aim is to devise a strategy that combines the benefits of adaptive methods, viz. fast initial progress, and the good final generalization properties of Sgd. Intuitively, we would like to construct an algorithm that behaves like adaptive methods early in training and like Sgd at the end.

We therefore propose the variants of Adam and AMSGrad, called AdaBound and AMSBound by applying dynamic bound on learning rates. Inspired by gradient clipping, a popular technique in practice that clips the gradients larger than a threshold to avoid gradient explosion, we may also employ clipping on learning rates in Adam to propose AdaBound. Consider applying the following operation in Adam $$\mathrm{Clip}(\alpha / \sqrt{V_t}, \eta_l, \eta_u),$$ which clips the learning rate element-wisely such that the output is constrained to be within the lower bound $\eta_l$ and the upper bound $\eta_u$. It follows that Sgd(M) with the step size $\alpha^*$ can be considered as the case where $\eta_l = \eta_u = \alpha^*$. As for Adam, $\eta_l = 0$ and $\eta_u = \infty$. To achieve a gradual transformation from Adam to Sgd, we can employ $\eta_l$ and $\eta_u$ as functions of $t$ instead of constant values, where $\eta_l(t)$ is a non-decreasing function that starts from $0$ as $t = 0$ and converges to $\alpha^*$ asymptotically; and $\eta_u(t)$ is a non-increasing function that starts from $\infty$ as $t = 0$ and also converges to $\alpha^*$ asymptotically. In this setting, AdaBound behaves just like Adam at the beginning as the bounds have very little impact on learning rates, and it gradually transforms to Sgd(M) as the bounds become more and more restricted. The AMSBound optimizer is obtained by employing similar approach on AMSGrad.

## Experiment Results

We conduct empirical study of different models to compare new variants with popular methods including: Sgd(M), AdaGrad, Adam and AMSGrad. Here we just provide several learning curve figures for a direct and shallow illustration. Please check the paper for the details of the experiment settings and further analysis.

## Robust Performance

We also find an interesting property of AdaBound: It is not very sensitive to the hyperparameters, especially compared with Sgd(M). This result is very surprising and exciting, for it may translate to a robust performance of AdaBound. We may spent less time fine-tuning the damned hyperparameters by using this lovely new optimizer! In common cases, a default final learning rate of 0.1 can achieve relatively good and stable results on unseen data.

Despite of its robust performance, we still have to state that, there is no silver bullet. It does not mean that you will be free from tuning hyperparameters once using AdaBound. The performance of a model depends on so many things including the task, the model structure, the distribution of data, and etc. You still need to decide what hyperparameters to use based on your specific situation, but you may probably use much less time than before!

## Code and Demo

We release the code of AdaBound on GitHub, built on PyTorch. You can install AdaBound easily via pip or directly copying and pasting the source code into your project. Besides, thanks to the awesome work by the GitHub team and the Jupyter team, the Jupyter notebook (.ipynb) files can render directly on GitHub. We provide several notebooks (like this one) for better visualization. We hope to illustrate the robust performance of AdaBound through these examples.