[2025-1] 백승우 - Perplexed by Perplexity: Perplexity-Based DataPruning With Small Reference Models

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a large

arxiv.org

1. Methods

전체 dataset 중에서 일부 data를 사용하여, perplexity를 계산하기 위한
small LM(reference model, 125M)을 학습
- 나머지 data에서 perplexity에 따라 pruning해서 LLM(final model, 1B/3B)을 학습
- Selection rate에 따라 low, medium, high perplexity로 나뉘어짐

2. Experiments

3개의 dataset에서 성능이 크게 향상되고 학습 단계도 줄어듦
여러 가지 서로 다른 데이터셋 구성을 대상으로 평가할 필요가 있음
- The Pile (15.6%의 웹데이터): High perplexity data에서 좋은 성능
- Dolma (81.31%의 웹데이터): Medium perplexity data에서 좋은 성능
Over-training이나 data-constrained training에서도 성능 향상
- Over-training: 최적 학습 토큰 수보다 더 많은 토큰(chinchilla의 5배)으로 학습할 때, standard training과 비교해 성능 향상 효과가 유지되거나 비슷
- Data-constrained training: 사용 가능한 데이터가 부족해 반복 학습이 필요한 상황에서도, 한정된 데이터 내에서 더 나은 성능을 얻을 수 있음
Data pruning 후에 test 셋의 perplexity가 높아질 수 있으나, downstream task의 성능 저하를 의미하지 않음
- Upstream(perplexity)만으로 data pruning의 효율을 평가하기에는 한계가 있음
- 때문에 downstream(결과물)로 평가

3. Limitations

웹 데이터가 적은 dataset에서는 high perplexity data에서 좋은 성능을 내고, 많은 dataset에서는 medium perplexity data가 좋은 성능을 내는데, 이에 대한 설명 부재

'Natural Language Processing' 카테고리의 다른 글

[2025-1] 백승우 - A-MEM: Agentic Memory for LLM Agents (0)	2025.03.05
[2025-1] 백승우 - LegalAgentBench: Evaluating LLM Agents in Legal Domain (0)	2025.03.04
[2025-1] 백승우 - Data Selection for Language Models via Importance Resampling (0)	2025.03.03
[2025-1] 김지원 - Mamba: Linear-Time Sequence Modeling with Selective State Spaces (0)	2025.02.23
[2025-1] 김학선 - Code Security Vulnerability Repair Using Reinforcement Learning with Large Language Models (0)	2025.02.18

[2025-1] 백승우 - Perplexed by Perplexity: Perplexity-Based DataPruning With Small Reference Models

1. Methods

2. Experiments

3. Limitations

'Natural Language Processing' 카테고리의 다른 글

관련글

티스토리툴바