Training Deep Neural Networks (DNNs) with billions of parameters generally involves pipeline-parallel (PP) execution. Unfortunately, PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15–30% and can exceed 60% of the training job’s GPU allocation. To improve the GPU utilization of PP model training, this paper describes