2025-05-26 Papers

1/2

Paper 1

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Published: 2025-05-23

Link: http://arxiv.org/pdf/2505.18129

1. 📘 Topic and Domain: The paper presents V-Triune, a unified reinforcement learning system for vision-language models that combines both visual reasoning and perception tasks.
2. 💡 Previous Research and New Ideas: Prior research focused separately on either reasoning tasks (math, science) or perception tasks (detection, grounding), while this paper proposes a novel unified approach combining both through a triple-component system and dynamic IoU reward mechanism.
3. ❓ Problem: The paper addresses the challenge of training vision-language models to perform both reasoning and perception tasks effectively within a single unified framework, as previous approaches treated these tasks in isolation.
4. 🛠️ Methods: The paper implements a three-tier system: Sample-Level Data Formatting (for unified task inputs), Verifier-Level Reward Computation (for custom rewards), and Source-Level Metric Monitoring (for diagnostics), along with a Dynamic IoU reward for perception tasks.
5. 📊 Results and Evaluation: The resulting Orsta models achieved significant improvements on MEGA-Bench Core benchmark, with gains ranging from +2.1% to +14.1% across different model variants (7B and 32B), while also showing strong performance on downstream tasks like MMMU, MathVista, and COCO.

One RL to See Them All: Visual Triple Unified Reinforcement Learning

V-Triune: Visual Triple Unified Reinforcement Learning System Input Data Visual Reasoning + Perception Tasks Sample-Level Data Formatting Unify diverse task inputs Verifier-Level Reward Computation Custom rewards via verifiers Source-Level Metric Monitoring Data source diagnostics Dynamic IoU Reward Mechanism Adaptive perception feedback Orsta Model Improved Performance Key Features: • Unified training for visual reasoning and perception • Modular reward computation system • Progressive perception feedback • Comprehensive metric monitoring
Q1
1. What is the main innovation of V-Triune compared to previous vision-language reinforcement learning approaches?
It uses a larger model architecture with more parameters
It unifies both reasoning and perception tasks in a single training framework
It focuses exclusively on improving visual perception tasks
Q2
2. Why did the researchers decide to freeze the ViT (Vision Transformer) parameters during training?
To save computational resources and training time
Because ViT was already perfectly trained for all tasks
Because joint training led to gradient explosion and performance collapse
Q3
3. What unique feature does the Dynamic IoU reward mechanism introduce?
It progressively adjusts the threshold from relaxed to stricter criteria during training
It randomly varies the reward threshold to prevent overfitting
It maintains a fixed high threshold throughout training
1/2

Paper 2

QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

Published: 2025-05-23

Link: http://arxiv.org/pdf/2505.17667

1. 📘 Topic and Domain: The paper focuses on developing long-context large reasoning models through reinforcement learning, specifically in the domain of natural language processing and artificial intelligence.
2. 💡 Previous Research and New Ideas: The paper builds on recent large reasoning models (LRMs) that demonstrate strong reasoning capabilities through RL in short-context tasks, and proposes a novel framework called QWEN LONG-L1 to extend these capabilities to long-context scenarios.
3. ❓ Problem: The paper addresses the challenge of extending large reasoning models to effectively process and reason on long-context inputs (e.g., 120K tokens) via reinforcement learning, tackling issues of suboptimal training efficiency and unstable optimization.
4. 🛠️ Methods: The paper implements a progressive context scaling framework combining warm-up supervised fine-tuning, curriculum-guided phased reinforcement learning, and difficulty-aware retrospective sampling strategy.
5. 📊 Results and Evaluation: QWEN LONG-L1-32B outperformed flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B across seven long-context document question-answering benchmarks, achieving performance comparable to Claude-3.7-Sonnet-Thinking.

QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

QwenLong-L1 Training Workflow Base Model Warm-up SFT Initial Policy Training Curriculum RL Progressive Context Scaling Retrospective Sampling Difficulty-Aware Selection QwenLong-L1 Model Training Components: • Group Relative Policy Optimization (GRPO) • Hybrid Reward Mechanisms (Rule-based + LLM-as-judge)
Q1
1. What is the main challenge that QWEN LONG-L1 aims to address in long-context reasoning?
Slow training speed and high computational costs
Suboptimal training efficiency and unstable optimization process
Limited memory capacity and model size constraints
Q2
2. Which component is NOT part of QWEN LONG-L1's progressive context scaling framework?
Warm-up supervised fine-tuning
Difficulty-aware retrospective sampling
Automated parameter pruning
Q3
3. When testing QWEN LONG-L1-14B with increased sampling scales, what interesting finding was observed?
It performed worse than smaller models
It surpassed DeepSeek-R1 even with a small sampling number
It required massive computational resources
1/2

Paper 3

QwenLong-CPRS: Towards infty-LLMs with Dynamic Context Optimization

Published: 2025-05-23

Link: http://arxiv.org/pdf/2505.18092

1. 📘 Topic and Domain: The paper presents QwenLong-CPRS, a context compression framework for large language models (LLMs) in the domain of natural language processing.
2. 💡 Previous Research and New Ideas: The work builds upon previous research in RAG frameworks and sparse attention mechanisms, proposing a novel dynamic context optimization paradigm that uses natural language instructions to guide multi-granularity context compression.
3. ❓ Problem: The paper addresses two key challenges: the prohibitive computational overhead during long sequence processing and the "lost in the middle" performance degradation where LLMs struggle to effectively handle lengthy inputs.
4. 🛠️ Methods: The authors implement four key innovations: natural language-guided dynamic optimization, bidirectional reasoning layers for boundary awareness, token critic mechanisms with language modeling heads, and window-parallel inference architecture.
5. 📊 Results and Evaluation: Across five benchmarks (4K-2M word contexts), QwenLong-CPRS achieved 21.59× context compression with 19.15-point average performance gains, surpassing leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench benchmarks.

QwenLong-CPRS: Towards infty-LLMs with Dynamic Context Optimization

QwenLong-CPRS Framework Input: System Prompt + User Query + Long Context Causal Language Modeling Layers Bi-directional Reasoning Layers Language Modeling as Token Critic Optimized Context Window-Parallel Inference Dynamic Context Optimization
Q1
1. What is the main innovation of QwenLong-CPRS compared to previous approaches like RAG and sparse attention?
It uses pre-trained language models for compression
It enables natural language-guided dynamic context optimization
It increases the context window size to 2M tokens
Q2
2. What level of context compression did QwenLong-CPRS achieve while maintaining performance?
5.5× compression with 10-point performance gain
15.3× compression with 15-point performance gain
21.59× compression with 19.15-point performance gain
Q3
3. Which of these is NOT one of the four key technical innovations mentioned in the paper?
Bidirectional reasoning layers for boundary awareness
Multi-modal context processing for image and text
Window-parallel inference architecture