2025-04-21 Papers

1/2

Paper 1

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Published: 2025-04-17

Link: http://arxiv.org/pdf/2504.13161

1. 📘 Topic and Domain: The paper introduces CLIMB, a framework for optimizing data mixtures for language model pre-training through clustering-based iterative bootstrapping.

2. 💡 Previous Research and New Ideas: The paper builds on previous data mixture approaches but proposes a novel method to automatically identify, evaluate, and refine data mixtures without relying on predefined domain labels.

3. ❓ Problem: The paper aims to solve the challenge of finding optimal pre-training data mixtures for language models when working with large-scale web datasets that lack inherent domain divisions.

4. 🛠️ Methods: The authors cluster documents in semantic space, then iteratively optimize mixture weights using a bootstrapping process with proxy models and predictors to progressively refine the data mixture.

5. 📊 Results and Evaluation: Using the optimal data mixture, their 1B model exceeded state-of-the-art Llama-3.2-1B by 2.0% on reasoning tasks, with domain-specific optimization yielding 5% improvement over random sampling; they also released ClimbLab and ClimbMix datasets.

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

1/2

Paper 2

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Published: 2025-04-18

Link: http://arxiv.org/pdf/2504.13837

1. 📘 Topic and Domain: The paper examines whether reinforcement learning (RL) actually creates new reasoning capabilities in large language models (LLMs) beyond what exists in base models, focusing on mathematical, programming, and visual reasoning tasks.

2. 💡 Previous Research and New Ideas: The paper builds on previous research in Reinforcement Learning with Verifiable Rewards (RLVR) but challenges the common belief that RLVR enables LLMs to develop novel reasoning abilities beyond their base models.

3. ❓ Problem: The paper aims to determine whether RLVR training genuinely introduces new reasoning capabilities to LLMs or merely optimizes existing capabilities from the base model.

4. 🛠️ Methods: The authors used the pass@k metric with large k values across multiple model families and benchmarks to measure the reasoning capability boundaries of both base and RL-trained models, combined with perplexity analysis.

5. 📊 Results and Evaluation: The results showed that while RL-trained models outperform base models at small k values, base models achieve higher pass@k scores at large k values, indicating that RLVR improves sampling efficiency but does not introduce new reasoning abilities beyond what already exists in the base models.

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

1/2

Paper 3

Antidistillation Sampling

Published: 2025-04-17

Link: http://arxiv.org/pdf/2504.13146

1. 📘 Topic and Domain: The paper introduces "Antidistillation Sampling," a technique in AI security that prevents language models from being effectively distilled while maintaining their functionality.

2. 💡 Previous Research and New Ideas: The paper builds on prior work in model distillation and data poisoning, proposing a novel approach to strategically modify a model's token probability distributions to resist distillation.

3. ❓ Problem: The paper aims to solve the problem of protecting proprietary large language models from being easily distilled by competitors who could use the models' reasoning traces to train their own systems at much lower cost.

4. 🛠️ Methods: The authors use a gradient-based approach that modifies the sampling distribution by adding a penalty term based on a directional derivative capturing how token choices would impact a distilled model's performance, implemented efficiently using a finite-difference approximation.

5. 📊 Results and Evaluation: Results show that antidistillation sampling successfully reduces student model performance (24.73% vs 51.86% on GSM8K) while maintaining comparable teacher model accuracy (68.51% vs 68.90%), demonstrating effective protection against distillation attempts.