Bifurcated attention for single-context large-batch sampling

Ben Athiwaratkun; Sujan Gonugondla; Sanjay Krishna Gouda; Hantian Ding; Qing Sun; Jun Wang; Jiacheng Guo; Liangfu Chen; Haifeng Qian; Parminder Bhatia; Ramesh Nallapati; Sudipta Sengupta; Bing Xiang

Publication

Bifurcated attention for single-context large-batch sampling

By Ben Athiwaratkun, Sujan Gonugondla, Sanjay Krishna Gouda, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Haifeng Qian, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang

2024

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, a significant factor in latency for high batch sizes and long context lengths. Bifurcated attention achieves this by dividing the attention mechanism during incremental decoding into two distinct GEMM operations, focusing on the KV cache from prefill and the decoding process. This method ensures precise computation and maintains the usual computational load (FLOPs) of standard attention mechanisms, but with reduced memory IO. Bifurcated attention is also compatible with multi-query attention mechanism known for reduced memory IO for KV cache, further enabling higher batch size and context length. The resulting efficiency leads to lower latency, improving suitability for real-time applications, e.g., enabling massively-parallel answer generation without substantially increasing latency, enhancing performance when integrated with postprocessing techniques such as reranking.

Bifurcated attention for single-context large-batch sampling

Latest news

Work with us