benchmark2 [2026-2] 염제원, 김학선 - AA-Omniscience: Evaluating Cross-Domain KnowledgeReliability in Large Language Models AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language ModelsExisting language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measurearxiv.org ArtificialAnalysis/AA-Omniscience-Public · Data.. 2026. 2. 16. [2025-2] 박지원 - Benchmark Inflation: Revealing LLM PerformanceGaps Using Retro-Holdouts 원문) https://openreview.net/forum?id=WdA5H9ARaa#discussion Benchmark Inflation: Revealing LLM Performance Gaps Using...Public benchmarks are compromised, as the training data for many Large Language Models (LLMs) is contaminated with test data, suggesting a performance gap between benchmark scores and actual...openreview.net Intro- LLM의 벤치마크 데이터셋에 대한 점수 인플레이션 문제에 대해, 공개 벤치마크 데이터셋이 train data에 오.. 2025. 9. 4. 이전 1 다음