MARTHE: Scheduling the learning rate via online hypergradients
2020
We study the problem of fitting task-specific learning rate schedules from the perspective of hyper-parameter optimization, aiming at good generalization. We describe the structure of the gradient of a validation error w.r.t. the learning rate schedule – the hypergradient. Based on this, we introduce MARTHE, a novel online algorithm guided by cheap approximations of the hypergradient that uses past information from the optimization trajectory to simulate future behaviour. It interpolates between two recent techniques, RTHO [Franceschi et al., 2017] and HD [Baydin et al., 2018], and is able to produce learning rate schedules that are more stable leading to models that generalize better.
Research areas