본문 바로가기

책상 밖 세상을 경험할 수 있는 Playground를 제공하고, 수동적 학습에서 창조의 삶으로의 전환을 위한 새로운 라이프 스타일을 제시합니다.

LLM3

[2026-1] 염제원, 김학선 - AA-Omniscience: Evaluating Cross-Domain KnowledgeReliability in Large Language Models AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language ModelsExisting language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measurearxiv.org ArtificialAnalysis/AA-Omniscience-Public · Data.. 2026. 2. 16.

[2025-2] 백승우 - Agent Learning via Early Experience Agent Learning via Early ExperienceA long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many enviarxiv.org 2025. 10. 15.

[2025-2] 박지원 - Benchmark Inflation: Revealing LLM PerformanceGaps Using Retro-Holdouts 원문) https://openreview.net/forum?id=WdA5H9ARaa#discussion Benchmark Inflation: Revealing LLM Performance Gaps Using...Public benchmarks are compromised, as the training data for many Large Language Models (LLMs) is contaminated with test data, suggesting a performance gap between benchmark scores and actual...openreview.net Intro- LLM의 벤치마크 데이터셋에 대한 점수 인플레이션 문제에 대해, 공개 벤치마크 데이터셋이 train data에 오.. 2025. 9. 4.

이전 1 다음

티스토리툴바