2025-06-18 Papers

1/2

Paper 1

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

Published: 2025-06-17

Link: http://arxiv.org/pdf/2506.14429

1. 📘 Topic and Domain: Analysis of long-context capabilities in diffusion-based Large Language Models (LLMs) in the field of Natural Language Processing.
2. 💡 Previous Research and New Ideas: Based on research in auto-regressive LLMs and RoPE scaling theory, proposes novel insights into diffusion LLMs' unique behavior with long contexts.
3. ❓ Problem: Addresses the unexplored area of how diffusion LLMs handle long context windows and whether they can be extended beyond their pretrained context lengths.
4. 🛠️ Methods: Conducts systematic comparison between diffusion and auto-regressive LLMs using Needle-In-A-Haystack tests, analyzes through RoPE theory, and proposes LongLLaDA method with NTK-based RoPE extrapolation.
5. 📊 Results and Evaluation: Achieved 6x context expansion (24k tokens) without further training, demonstrated diffusion LLMs maintain stable perplexity during extrapolation, match auto-regressive models in retrieval tasks but lag in aggregation while excelling at QA tasks.

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

LongLLaDA: Unlocking Long Context in Diffusion LLMs Initial Analysis Perplexity Evaluation NIAH Tests Local Perception Study Mechanistic Analysis RoPE Theory Position Embedding t-SNE Visualization LongLLaDA Method NTK-based RoPE Scaling Laws Context Extension Evaluation Results Retrieval Tasks Matches Auto-regressive Aggregation Tasks Lags Behind QA Tasks Excels
Q1
1. What unique characteristic did diffusion LLMs demonstrate during context length extrapolation compared to auto-regressive LLMs?
Complete failure in all tasks
Stable perplexity and local perception capabilities
Exponential improvement in performance
Q2
2. How much context expansion did LongLLaDA achieve without additional training?
2x expansion (8k tokens)
4x expansion (16k tokens)
6x expansion (24k tokens)
Q3
3. In which type of task did diffusion LLMs consistently underperform compared to auto-regressive LLMs?
Question Answering tasks
Retrieval tasks
Aggregation tasks
1/2

Paper 2

Reasoning with Exploration: An Entropy Perspective

Published: 2025-06-17

Link: http://arxiv.org/pdf/2506.14758

1. 📘 Topic and Domain: The paper focuses on improving language model reasoning capabilities through entropy-based reinforcement learning in the domain of natural language processing.
2. 💡 Previous Research and New Ideas: Building on traditional reinforcement learning exploration methods, the paper proposes a novel approach of using entropy as a signal to encourage exploratory reasoning behaviors in language models.
3. ❓ Problem: The paper addresses the issue of language models becoming overly exploitative during reinforcement learning training, leading to performance plateaus and limited reasoning capabilities.
4. 🛠️ Methods: The authors introduce a minimal modification to standard reinforcement learning by augmenting the advantage function with a clipped, gradient-detached entropy term that promotes longer reasoning chains while preserving original optimization.
5. 📊 Results and Evaluation: The method achieved significant improvements on Pass@K metrics across multiple mathematical reasoning benchmarks, even with large K values, demonstrating enhanced reasoning capabilities compared to baseline models.

Reasoning with Exploration: An Entropy Perspective

Reasoning with Exploration: An Entropy Perspective Preliminary Analysis Pivotal Tokens Reflective Actions Rare Behaviors Method Entropy-Based Advantage Shaping Gradient Detachment Implementation PPO Integration GRPO Integration One-line Code Change Results Improved Pass@K Performance Enhanced Exploratory Reasoning Longer Reasoning Chains Better Reasoning Boundaries
Q1
1. What key observation about entropy led to the paper's main innovation?
High entropy regions correlated with debugging statements
High entropy regions correlated with exploratory reasoning behaviors
High entropy regions indicated model errors
Q2
2. How does the paper's method differ from traditional entropy-based reinforcement learning approaches?
It removes entropy calculations completely
It uses entropy to predict model accuracy
It uses entropy to shape advantages while preserving original gradient flow
Q3
3. What unique feature helps prevent the paper's method from over-encouraging exploration?
Manual tuning of exploration parameters
A fixed decay schedule for entropy
Natural tension between entropy and model confidence
1/2

Paper 3

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Published: 2025-06-17

Link: http://arxiv.org/pdf/2506.14245

1. 📘 Topic and Domain: The paper explores Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Models (LLMs), focusing on improving reasoning capabilities.
2. 💡 Previous Research and New Ideas: Based on previous research showing RLVR-tuned models underperforming base models on Pass@K metrics, the paper proposes a new perspective that RLVR actually incentivizes correct reasoning rather than just finding correct answers.
3. ❓ Problem: The paper aims to resolve the contradiction of why RLVR-tuned models show worse Pass@K performance than base models despite supposedly improving reasoning capabilities.
4. 🛠️ Methods: The authors introduce a new metric called CoT-Pass@K that evaluates both reasoning path and final answer correctness, develop theoretical frameworks explaining RLVR's optimization process, and conduct empirical validation using LLM verifiers.
5. 📊 Results and Evaluation: Results show that RLVR consistently improves CoT-Pass@K across all K values, indicating genuine enhancement of reasoning capabilities, and analysis of training dynamics reveals this improvement emerges early in training and generalizes well.

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

RLVR Workflow for Incentivizing Correct Reasoning Base LLM Reinforcement Learning with Verifiable Rewards (RLVR) Correct CoTs Correct Answers CoT-Pass@K Pass@K Improved Reasoning Capabilities Verified through CoT-Pass@K Metric
Q1
1. What is the key limitation of the traditional Pass@K metric according to the paper?
It only measures the speed of model responses
It credits correct answers even when they come from flawed reasoning paths
It can only evaluate simple mathematical problems
Q2
2. How does the paper's theoretical framework distinguish RLVR from traditional reinforcement learning?
RLVR focuses on maximizing reward values only
RLVR requires more computational resources
RLVR emphasizes logical integrity of the entire reasoning path rather than just correct actions
Q3
3. What novel approach did the researchers use to verify the correctness of reasoning chains at scale?
They relied solely on human experts
They used DeepSeek-R1-0528-Qwen3-8B as an automated verifier with multiple verification attempts
They only checked the final answers