[2025-1] 김은서 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023)

kes305 2025. 2. 2. 14:08

Direct Preference Optimization: Your Language Model is Secretly a...

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining s

arxiv.org

Introduction

LLM을 조정하는 기존의 방법은 RL(강화학습)이다. 이때, LLM 모델에 선호도 학습을 포함함으로서 성능이 뛰어난 AI 시스템을 구축할 수 있다. 선호도 학습의 가장 성공적인 방법은 RLHF이다. 하지만 RLHF는 여러 LM 학습과 LM policy sampling 등에서 상당한 계산 비용이 발생한다는 문제가 있다. 따라서 본 논문에서는 명시적인 reward modeling과 RL 없이 인간의 선호도에 맞추어 LM을 최적화하는 방법을 제시하는데, 그 방법이 바로 DPO다.

RLHF

1단계: Supervied fine-tuning (SFT)

이전에 학습된 LM을 fine-tuning한다
fine-tuning된 모델 $\pi^{SFT}$을 얻는다

2단계: Reward Modelling Phase

x를 input이라고 생각하면, $y_{1}$과 $y_{2}$는 SFT 모델이 생성한 output이다
인간에 의해 $y_{1}$과 $y_{2}$ 사이에서 선호가 표현된다 (이때, 선호하는 답은 $y_{w}$, 선호하지 않는 답을 $y_{l}$라고 하자)
선호도는 접근할 수 없는 어떤 latent reward model $r^*(x,y)$에 의해 생성된 것으로 가정하자
여기에서는 $r$을 정의하는데 BT model이 사용되었다

negative log-likelihood loss는 다음과 같고, 이를 이용하여 Reward Model을 학습시킨다.