2025-07-28 Papers

1/2

Paper 1

The Invisible Leash: Why RLVR May Not Escape Its Origin

Published: 2025-07-20

Link: http://arxiv.org/pdf/2507.14843

1. 📘 Topic and Domain: The paper examines the limitations of Reinforcement Learning with Verifiable Rewards (RLVR) in large language models, specifically focusing on reasoning capabilities and model behavior.
2. 💡 Previous Research and New Ideas: Based on recent advances in large reasoning models using RLVR, the paper proposes a new theoretical framework showing that RLVR is constrained by the base model's support and operates as a conservative reweighting mechanism.
3. ❓ Problem: The paper investigates whether RLVR truly expands a model's reasoning capabilities or merely amplifies existing high-reward outputs from the base model.
4. 🛠️ Methods: The authors conduct theoretical analysis and empirical experiments across various reasoning tasks, examining empirical support dynamics, entropy metrics, and performance on mathematical and non-mathematical reasoning benchmarks.
5. 📊 Results and Evaluation: Results show that while RLVR improves pass@1 accuracy, it tends to shrink rather than expand the model's empirical support, with entropy reduction leading to narrower solution spaces and potentially missing valid solutions accessible to the base model.

The Invisible Leash: Why RLVR May Not Escape Its Origin

The Invisible Leash: RLVR Workflow Analysis Base Model q(y|x) RLVR Training Verifiable Rewards R(x,y) ∈ {0,1} RLVR Model π_θ(y|x) Theoretical Analysis Support Preservation Conservative Updates Entropy-Reward Tradeoff Empirical Analysis Support Preservation Support Shrinkage Support Expansion Entropy Dynamics Analysis Key Findings Improved pass@1 accuracy Shrinkage > Expansion Answer-level entropy ↓ Token-level entropy varies Conservative reweighting mechanism Experimental Setup Math Tasks Reasoning Tasks ProRL vs Base High k sampling k ∈ {1024, 2048, 4096, 8192, 16384} Conclusions RLVR acts as conservative reweighting mechanism within base model support Breaking the "invisible leash" requires explicit exploration mechanisms Future Directions Explicit exploration Hybrid strategies supp(π_θ) ⊆ supp(q) H[π_θ] ≤ H[q] pass@1 ↑ pass@k ↓ Precision ↑ Diversity ↓ Legend Improvement Degradation
Q1
1. What is the main trade-off identified in RLVR according to the paper?
Speed versus accuracy
Precision versus exploration diversity
Model size versus performance
Q2
2. When comparing token-level entropy and answer-level entropy in RLVR models, what interesting phenomenon was observed?
Both types of entropy always decreased together
Token-level entropy sometimes increased while answer-level entropy consistently declined
Both types of entropy remained constant throughout training
Q3
3. According to the paper's theoretical framework, why can't RLVR discover completely new solutions?
Because the model is too small to generate new solutions
Because it cannot sample solutions with zero initial probability from the base model
Because the training data is insufficient
1/2

Paper 2

Pixels, Patterns, but No Poetry: To See The World like Humans

Published: 2025-07-21

Link: http://arxiv.org/pdf/2507.16863

1. 📘 Topic and Domain: The paper focuses on evaluating and testing the visual perception capabilities of Multimodal Large Language Models (MLLMs) through a new benchmark called Turing Eye Test (TET).
2. 💡 Previous Research and New Ideas: Previous research focused on reasoning capabilities of MLLMs, while this paper proposes a novel approach by shifting focus to testing fundamental visual perception abilities through specialized perceptual tasks.
3. ❓ Problem: The paper addresses whether MLLMs can truly perceive visual information like humans do, revealing a fundamental gap between machine and human perception capabilities.
4. 🛠️ Methods: The authors created four diagnostic tasks (HiddenText, 3DCaptcha, ColorBlind, and ChineseLigatures) and evaluated 15 state-of-the-art MLLMs using Pass@1 and Pass@K metrics, along with analyzing model behavior through Grad-CAM visualization.
5. 📊 Results and Evaluation: Results showed catastrophic failures of current MLLMs on these perceptual tasks, with most models achieving near-zero success rates, while fine-tuning the vision tower enabled rapid adaptation, suggesting the limitation lies in visual perception rather than reasoning capabilities.

Pixels, Patterns, but No Poetry: To See The World like Humans

Turing Eye Test (TET) Methodology Flow Dataset Creation Four Specialized Tasks: • HiddenText (150 images) • 3DCaptcha (150 images) • ColorBlind (150 images) • ChineseLigatures (40 phrases) Synthetic visual challenges Model Evaluation 15 State-of-the-art MLLMs • Unified models (Show-o2, Bagel) • API models (Claude, Gemini, o1) • Open-source (Qwen, InternVL) Metrics: Pass@1, Pass@32 Temperature: 0.3, Max tokens: 16384 Results Analysis Catastrophic Failures Most models: 0% Pass@1 Peak improvement: <4% Minimal Pass@K variance Vision Perception Bottleneck Identified Grad-CAM Analysis Attention Visualization • Vision Encoder (ViT) • Language Backbone (LLM) Findings: Failed to locate target regions Attention scattered incorrectly Supervised Fine-tuning Five Training Configurations: • Full parameters • Vision encoder only • Vision + adapter • Language backbone only • Adapter only Vision tuning essential! In-Context Learning 3-Example Demonstrations Same-domain exemplars Context augmentation Result: Virtually no improvement Knowledge ≠ Perception Image Processing Resolution Analysis • Downsampling • Blurring effects Finding: Downsampling helps Vision patch limitation Key Insights Vision tower generalization is the bottleneck, not language reasoning Fine-tuning vision encoder enables rapid adaptation to perceptual tasks Current MLLMs lack human-like visual perception capabilities Conclusion TET reveals fundamental visual perception limitations in current MLLMs Future work: Enhanced visual generalization methods and full TET benchmark
Q1
1. What is the main insight revealed by fine-tuning experiments in the paper?
MLLMs lack sufficient training data for visual tasks
The limitation lies in the vision tower's perception capabilities rather than reasoning
The language backbone needs more parameters to improve performance
Q2
2. In the HiddenText experiment, what happened when images were downsampled?
Performance got worse due to loss of detail
Performance improved as it simplified the character patterns
There was no significant change in performance
Q3
3. What unique aspect differentiates TET from previous MLLM benchmarks?
It focuses on testing visual perception rather than reasoning capabilities
It uses a larger dataset than previous benchmarks
It only tests Chinese language understanding
1/2

Paper 3

nablaNABLA: Neighborhood Adaptive Block-Level Attention

Published: 2025-07-17

Link: http://arxiv.org/pdf/2507.13546

1. 📘 Topic and Domain: Video generation using transformer models, specifically focusing on optimizing attention mechanisms in video diffusion transformers.
2. 💡 Previous Research and New Ideas: Based on previous work in sparse attention mechanisms and Sliding Tile Attention (STA), proposes a novel adaptive approach called NABLA that dynamically determines attention patterns rather than using fixed patterns.
3. ❓ Problem: Addresses the quadratic computational complexity of full attention mechanisms in video generation transformers, which becomes a bottleneck for high-resolution and long-duration videos.
4. 🛠️ Methods: Implements a Neighborhood Adaptive Block-Level Attention mechanism that uses downsampling and thresholding to dynamically select important attention blocks, combined with STA for optimal performance.
5. 📊 Results and Evaluation: Achieved 2.7× faster training and inference compared to baseline models while maintaining equivalent quality metrics (CLIP score, VBench score, human evaluation), with successful validation through both objective metrics and human evaluation studies.

nablaNABLA: Neighborhood Adaptive Block-Level Attention

NABLA: Neighborhood Adaptive Block-Level Attention Workflow Video Input T×H×W×D Token Reordering Fractal Flattening Patch Size P×P Q = XW_Q K = XW_K V = XW_V NABLA Mask Computation 1. Block Averaging Q_a, K_a ∈ R^(h×S/N×D) 2. Reduced Attention A = softmax(Q_a K_a^T/√D) 3. CDF Computation vals, order = sort(A) 4. Binarization M = cumsum(vals) ≥ 1-thr 5. Sparse Mask M_∇ reorder(M, order) STA Mask Sliding Tile Attention Window: (W_T, W_H, W_W) M_STA Mask Combination M = M_∇ ∨ M_STA Logical OR Operation Sparse Attention FlexAttention(Q, K, V, M) PyTorch Implementation Output Sparse Attention Result Performance Results Speed Improvement 2.7× Faster Sparsity Level 80-92% Quality Maintained CLIP/VBench Training Support 1.46× Speedup Key Features Dynamic threshold selection via CDF Hardware-agnostic FlexAttention integration Adaptive sparsity without custom CUDA kernels Complementary with STA for optimal quality
Q1
1. What is the main innovation of NABLA compared to previous sparse attention approaches?
It uses custom CUDA kernels for faster computation
It dynamically adapts attention patterns based on content
It reduces the input resolution to save memory
Q2
2. What speed improvement did NABLA achieve while maintaining quality metrics?
1.5x faster than baseline
2.7x faster than baseline
4x faster than baseline
Q3
3. When combining NABLA with STA (Sliding Tile Attention), what was the main benefit?
It reduced computational costs by 95%
It improved the visual quality metrics significantly
It helped mitigate boundary artifacts while maintaining efficiency