SeRA: Self-reviewing and alignment of LLMs using implicit reward margins

Jongwoo Ko; Saket Dingliwal; Bhavana Ganesh; Sailik Sengupta; Sravan Bodapati; Aram Galstyan

Publication

SeRA: Self-reviewing and alignment of LLMs using implicit reward margins

By Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh, Sailik Sengupta, Sravan Bodapati, Aram Galstyan

2025

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives for Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the preferences used in DAAs are usually collected before the alignment training begins and remain unchanged (off-policy). This design leads to two problems where the policy model (1) picks up on spurious correlations in the dataset (as opposed to learning the intended alignment expressed in the human preference labels) and (2) overfits to feedback on off-policy trajectories that have less likelihood of being generated by the updated policy model. To address these issues, we introduce Self-Reviewing and Alignment (SeRA), a cost-efficient and effective method that can be readily combined with existing DAAs. SeRA comprises two components: (1) sample selection using implicit reward margins, which helps alleviate overfitting to some undesired features, and (2) preference bootstrapping using implicit rewards to augment preference data with updated policy models in a cost-efficient manner. Extensive experimentation, including some on instruction-following tasks, demonstrate the effectiveness and generality of SeRA in training LLMs on offline preference datasets with DAAs.

SeRA: Self-reviewing and alignment of LLMs using implicit reward margins

Latest news

Work with us