MQTransformer: Context dependent attention and Bregman Volatility

Kevin (NYC) Chen; Lee Dicker; Carson Eisenach; Dhruv Madeka

Publication

MQTransformer: Context dependent attention and Bregman Volatility

By Kevin (NYC) Chen, Lee Dicker, Carson Eisenach, Dhruv Madeka

2022

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Time series forecasting is a fundamental problem in machine learning with relevance to many applications including supply chain management, finance, healthcare, etc. As an example, consider a large e-commerce retailer with a system to produce forecasts of the demand distribution for a set of products at a target time T. Using these forecasts as an input, the retailer can then optimize buying and placement decisions. Accurate forecasts are important, but – perhaps less obviously – forecasts that don’t exhibit excess volatility as a target date approaches minimize costly effects in a supply chain.

Recent work applying deep learning to time-series forecasting focuses primarily on the use of recurrent and convolutional architectures; see Benidis et al. for a complete overview. These are Seq2Seq architectures – which consist of an encoder that summarizes an input sequence into a fixed-length context vector, and a decoder which produces an output sequence. Because real-world forecasting systems increasingly rely on neural nets, a need for black-box forecasting system diagnostics has arisen. The initial work in this area considered the evolution of a sequence of forecasts for a single binary outcome in continuous time. More recently, extended this work to quantile forecasts. In this paper, we develop a notion called Bregman Volatility to quantify the amount of forecast volatility beyond the forecast changes required to incorporate new information and improve accuracy. While Bregman Volatility can detect flaws in forecasts, how to incorporate that into model design is unexplored; existing multi-horizon forecasting architectures do not explicitly handle excess variation.

Another limitation in many existing architectures is the information bottleneck, where the encoder transmits information to the decoder via a single hidden state. Attention mechanisms address this by allowing the decoder to take as input a weighted combination of relevant latent encoder states, rather than using a single context. Many variants have been proposed including self-attention and dot-product attention, and transformer architectures (end-to-end attention with no recurrent layers) achieve state-of-the-art performance on NLP tasks. Absolute position encodings commonly used in the literature cannot be applied to time series forecasting. Our work differs from prior work on relative position encodings in that we learn a representation from indicator variables of events relevant to the target application (e.g. holidays).

MQTransformer: Context dependent attention and Bregman Volatility

Latest news

Work with us