분류 전체보기263 [2025-1] 김학선 - Secrets of RLHF in Large Language Models Part I: PPO https://arxiv.org/abs/2307.04964 Secrets of RLHF in Large Language Models Part I: PPOLarge language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount signarxiv.orgAbstractLLMs(대규모 언어 모델)의 목표가 인간 중심적인 보조자로 기능하는 것.. 2025. 2. 2. [2025-1] 김은서 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023) Direct Preference Optimization: Your Language Model is Secretly a... Direct Preference Optimization: Your Language Model is Secretly a Reward ModelWhile large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gain.. 2025. 2. 2. [2025-1] 계진혁 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model 논문 링크: https://arxiv.org/abs/2305.18290 Direct Preference Optimization: Your Language Model is Secretly a Reward ModelWhile large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining sarxiv.org 서론 및 논문 핵심 요약... 2025. 2. 1. [2025-1] PPO(proximal policy optimization) https://www.youtube.com/watch?v=cIyXYYdZIsk&ab_channel=%ED%98%81%ED%8E%9C%ED%95%98%EC%9E%84%7CAI%26%EB%94%A5%EB%9F%AC%EB%8B%9D%EA%B0%95%EC%9D%98정책 그레디언트 근사는 다음과 같이 나타낼 수 있다.이 식에서 원래 예전 알고리즘에 경우에는 old policy에 대해서 재활용을 하지 않고 trajectory를 rollout하고 업데이트 시키고 다른 trajectory를 roll out하고 업데이트하는 식으로 학습을 진행했지만 PPO에서는 old policy를 사용해서 sample efficiency를 높이고자 하였다. old policy를 이용하게 된다면 위의 정책 그레디언트 식은 다음과 같이 변.. 2025. 2. 1. 이전 1 ··· 19 20 21 22 23 24 25 ··· 66 다음