2025-04-21 Papers

1/2

Paper 1

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Published: 2025-04-17

Link: http://arxiv.org/pdf/2504.13161

1. 📘 Topic and Domain: The paper introduces CLIMB, a framework for optimizing data mixtures for language model pre-training through clustering-based iterative bootstrapping.
2. 💡 Previous Research and New Ideas: The paper builds on previous data mixture approaches but proposes a novel method to automatically identify, evaluate, and refine data mixtures without relying on predefined domain labels.
3. ❓ Problem: The paper aims to solve the challenge of finding optimal pre-training data mixtures for language models when working with large-scale web datasets that lack inherent domain divisions.
4. 🛠️ Methods: The authors cluster documents in semantic space, then iteratively optimize mixture weights using a bootstrapping process with proxy models and predictors to progressively refine the data mixture.
5. 📊 Results and Evaluation: Using the optimal data mixture, their 1B model exceeded state-of-the-art Llama-3.2-1B by 2.0% on reasoning tasks, with domain-specific optimization yielding 5% improvement over random sampling; they also released ClimbLab and ClimbMix datasets.

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

CLIMB Methodology Flowchart Phase 1: Data Preprocessing & Clustering 1. Embed Texts Large Raw Dataset (^D) -> Embedding Vectors (E) 2. Cluster Embeddings K-means on E -> Initial Clusters (K_init) 3. Refine Clusters - Prune low-quality (K_pruned) - Merge similar (K_enhanced) -> Final Clusters (D) Set of Data Clusters (D) Phase 2: Iterative Mixture Bootstrapping (K Iterations) Iteration k k = 1 to K Sample/Select Mixtures (α) (Random for k=1) (Guided by Predictor f_k-1 for k>1) Train Proxy Models (on sampled mixtures) -> Get Performance (ℓ(α, ω*)) Update Evaluated Set (S_k) Combine previous S_k-1 with new (α, Performance) pairs Train/Update Predictor (f_k) (e.g., LightGBM Regression) Uses all data in S_k Input Clusters Predictor guides next iteration's sampling Use Final Predictor (f_K) -> Identify Optimal Mixture (α*) After K Iterations
Q1
1. What is the main innovation of CLIMB compared to previous data mixture methods?
It uses larger proxy models to evaluate data quality
It automatically identifies and optimizes data mixtures without relying on predefined domain labels
It focuses exclusively on reasoning tasks rather than general capabilities
Q2
2. In the CLIMB framework, what is the purpose of the iterative bootstrapping process?
To train increasingly larger language models at each iteration
To gradually filter out low-quality web content
To progressively refine the search space and eliminate suboptimal data mixture candidates
Q3
3. What was a key finding from the ablation studies on CLIMB?
Using a 62M proxy model performed better than using a 350M proxy model
More search iterations improved performance, but compute should be balanced between depth and breadth
Random initialization consistently outperformed Dirichlet initialization
1/2

Paper 2

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Published: 2025-04-18

Link: http://arxiv.org/pdf/2504.13837

1. 📘 Topic and Domain: The paper examines whether reinforcement learning (RL) actually creates new reasoning capabilities in large language models (LLMs) beyond what exists in base models, focusing on mathematical, programming, and visual reasoning tasks.
2. 💡 Previous Research and New Ideas: The paper builds on previous research in Reinforcement Learning with Verifiable Rewards (RLVR) but challenges the common belief that RLVR enables LLMs to develop novel reasoning abilities beyond their base models.
3. ❓ Problem: The paper aims to determine whether RLVR training genuinely introduces new reasoning capabilities to LLMs or merely optimizes existing capabilities from the base model.
4. 🛠️ Methods: The authors used the pass@k metric with large k values across multiple model families and benchmarks to measure the reasoning capability boundaries of both base and RL-trained models, combined with perplexity analysis.
5. 📊 Results and Evaluation: The results showed that while RL-trained models outperform base models at small k values, base models achieve higher pass@k scores at large k values, indicating that RLVR improves sampling efficiency but does not introduce new reasoning abilities beyond what already exists in the base models.

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Paper Workflow: Re-examining RLVR's Impact on Reasoning Does RLVR truly add NEW reasoning capabilities beyond the Base Model? Challenge: Traditional metrics (e.g., pass@1) show average performance, not capability limits. Proposed Method: Evaluate Reasoning Boundary using pass@k with LARGE k. (Rationale: If Base Model solves problems with enough samples (large k), the capability might already exist, just less efficiently sampled.) Experimental Setup: Compare Base vs. RLVR Models Models: - Qwen-2.5 (7B, 14B, 32B) - LLaMA-3.1-8B - Qwen-2.5-VL-7B Tasks & Benchmarks: - Math: GSM8K, MATH500, Minerva, Olympiad, AIME24, AMC23 - Code: LiveCodeBench, HumanEval+, MBPP+ - Visual Reasoning: MathVista (filtered), MathVision (filtered) RL Approach: - Primarily "Zero RL" (RL on Base) - Algorithms: GRPO (main), PPO, etc. (later analysis) - Evaluation: Zero-shot prompts, T=0.6, Top-p=0.95, large k (e.g., 256, 1024) Key Measurement: Plot pass@k curves for Base vs. RLVR models. Deep Analysis to Understand the Observed pass@k Trends 1. CoT Validity Check (Math/Visual) Problem: Correct answer via wrong reasoning? Method: - Filter guessable problems (AIME24). - Manually inspect CoTs for hardest problems solved at large k. 2. Coverage & Perplexity Analysis Method 1: Solvable Set Comparison - Check if {Problems solved by RL} ⊆ {Problems solved by Base} at large k. Method 2: Perplexity - Calculate PPL_Base(Y_RL) vs PPL_Base(Y_Base). 3. Comparison with Distillation Question: Does distillation behave differently? Method: - Compare pass@k curves of Base vs. RL vs. Distilled Model (e.g., DeepSeek-R1 distilled into Qwen). 4. RL Algorithm & Training Step Analysis Method 1: Algorithm Comparison - Use VeRL framework for fair comparison (PPO, GRPO, RLOO...). - Define Sampling Efficiency Gap (ΔSE) = pass@k(Base) - pass@1(RL). - Evaluate on Omni-MATH splits. Method 2: Training Steps - Track pass@1 and pass@k(large) vs. training steps. Connecting Analyses to Research Question - Does large-k base performance match/exceed RL (pass@k)? - Are RL solutions already likely under the base model (Perplexity)? - Is the set of RL-solvable problems a subset of base-solvable (Coverage)? - Does distillation show different boundary expansion (Distillation)? - How close are RL algos to the base model's boundary (ΔSE)? - Does longer RL training shrink the boundary (Training Steps)? -->
Q1
1. According to the paper, what is the primary effect of Reinforcement Learning with Verifiable Rewards (RLVR) on LLMs?
It creates entirely new reasoning capabilities beyond what exists in the base model
It improves sampling efficiency by biasing the model toward rewarded reasoning paths
It increases the model's ability to explore novel reasoning patterns
Q2
2. What surprising phenomenon did the researchers observe when comparing base models to RL-trained models at large k values?
Base models consistently outperformed their RL-trained counterparts
Both models performed equally well regardless of k value
RL-trained models showed exponential improvement as k increased
Q3
3. How does the paper distinguish between the effects of RLVR and distillation on LLM reasoning capabilities?
RLVR and distillation both reduce the model's reasoning boundary in similar ways
RLVR is bounded by the base model's capabilities, while distillation can genuinely introduce new knowledge
Distillation improves sampling efficiency while RLVR expands reasoning patterns
1/2

Paper 3

Antidistillation Sampling

Published: 2025-04-17

Link: http://arxiv.org/pdf/2504.13146

1. 📘 Topic and Domain: The paper introduces "Antidistillation Sampling," a technique in AI security that prevents language models from being effectively distilled while maintaining their functionality.
2. 💡 Previous Research and New Ideas: The paper builds on prior work in model distillation and data poisoning, proposing a novel approach to strategically modify a model's token probability distributions to resist distillation.
3. ❓ Problem: The paper aims to solve the problem of protecting proprietary large language models from being easily distilled by competitors who could use the models' reasoning traces to train their own systems at much lower cost.
4. 🛠️ Methods: The authors use a gradient-based approach that modifies the sampling distribution by adding a penalty term based on a directional derivative capturing how token choices would impact a distilled model's performance, implemented efficiently using a finite-difference approximation.
5. 📊 Results and Evaluation: Results show that antidistillation sampling successfully reduces student model performance (24.73% vs 51.86% on GSM8K) while maintaining comparable teacher model accuracy (68.51% vs 68.90%), demonstrating effective protection against distillation attempts.

Antidistillation Sampling

Antidistillation Sampling Workflow Methodology Flowchart Goal: Modify Sampling to Achieve: 1. Non-distillability (Poison student training) 2. Nominal Utility (Maintain teacher performance) Phase 1: Initialization (Once) Define Models: Teacher (θT), Proxy (θP) Define Downstream Loss (ℓ) (e.g., NLL on benchmark) Compute Loss Gradient: g ← ∇ℓ(θP) (Store g and θP + ϵg) Phase 2: Token Generation Loop (For each token t) Input: Current sequence x1:t 1. Get Teacher Probs: log p( · | x1:t ; θT ) 2. Compute Approx. Penalty b∆: Get Proxy Probs: P_orig = log p( · | x1:t ; θP ) Get Perturbed Proxy Probs: P_pert = log p( · | x1:t ; θP + ϵg ) b∆ ← (P_pert - P_orig) / ϵ 3. Combine & Adjust Scores: Scores(·) = log p(·|x1:t; θT)/τ + λb∆(·|x1:t) (τ: temperature, λ: penalty weight) 4. Sample Next Token: xt+1 ∼ Softmax( Scores(·) ) (Append xt+1) Repeat for N tokens Output: Poisoned Reasoning Trace x1:N Phase 3: Evaluation 1. Generate traces using Antidistillation (varying λ) and Baseline (Temperature) Sampling. 2. Distill Student Model (e.g., Llama-3.2-3B) on generated traces. 3. Measure Performance: Teacher Accuracy (e.g., GSM8K, MATH) Student Accuracy (e.g., GSM8K, MATH) 4. Analyze Trade-off: Compare Teacher Utility vs. Student Distillability (Fig 1, Fig 2).
Q1
1. What is the primary goal of antidistillation sampling?
To improve the accuracy of language models on reasoning tasks
To protect proprietary models by preventing effective distillation while maintaining model utility
To reduce the computational cost of training large language models
Q2
2. How does antidistillation sampling technically work?
By completely hiding token probabilities from model outputs
By adding random noise to the sampling distribution
By adding a penalty term based on the directional derivative of student model performance
Q3
3. In the GSM8K benchmark experiments, what was demonstrated about antidistillation sampling?
It improved both teacher and student model performance
It maintained teacher accuracy around 68% while reducing student accuracy to about 25% (compared to 52% with temperature sampling)
It completely eliminated the possibility of distillation but severely degraded teacher performance