2025-09-04 Papers

1/2

Paper 1

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Published: 2025-08-30

Link: http://arxiv.org/pdf/2509.00676

1. 📘 Topic and Domain: The paper explores vision-language modeling, specifically challenging the conventional separation between critic models (which evaluate outputs) and policy models (which generate responses).
2. 💡 Previous Research and New Ideas: Based on previous research separating critic and policy models, this paper proposes reorganizing preference-labeled critic datasets into verifiable training signals and using reinforcement learning to create a unified model capable of both evaluation and generation.
3. ❓ Problem: The paper aims to solve the traditional limitation of treating critic models solely as evaluators rather than generators, seeking to create a single model that excels at both tasks.
4. 🛠️ Methods: The authors use reinforcement learning on a base generative model (Qwen-2.5-VL-7B) with reorganized preference-labeled critic datasets, creating LLaVA-Critic-R1 and its enhanced version LLaVA-Critic-R1+.
5. 📊 Results and Evaluation: The model achieved significant improvements over its base model (+5.7% average gain across 26 benchmarks), reached state-of-the-art performance on MMMU (71.9) at 7B scale, and demonstrated +13.8% improvement on reasoning tasks through self-critique at test time.

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

LLaVA-Critic-R1: Methodology Flow Chart Data Preparation 40K Pairwise Critic Data Strip GPT rationales Keep: Image + Question + Two Responses + Preference Labels RL Training (GRPO) Preference Reward (α=0.9) Format Reward (α=0.1) <think>...</think> \boxed{answer} Self-derived reasoning Base Model Qwen-2.5-VL-7B Mimo-VL / LLaMA-3.2 Policy Model ThinkLite-VL-7B Other reasoning VLMs LLaVA-Critic-R1 Base + Critic Training +5.7% avg improvement Strong dual capability LLaVA-Critic-R1+ Policy + Critic Training 71.9 MMMU (SOTA) Enhanced performance Policy Evaluation 26 Visual Benchmarks • Perception & VQA • Image Reasoning • Chart Understanding Critic Evaluation Visual RewardBench • VLRewardBench • MM-RLHF Superior judgment Test-Time Scaling Self-Critic (Best-of-128) +13.8% improvement Pairwise comparison Recursive selection Key Findings 1. Critic RL training surprisingly improves policy performance across diverse tasks 2. Single model excels at both evaluation and generation simultaneously 3. Enhanced critic enables effective test-time scaling without additional training 4. Path toward scalable, self-improving multimodal systems Ablation Studies Enhanced Visual Perception Structured Reasoning (Think-then-Answer) Policy-then-Critic Training Strategy RL vs SFT Comparison
Q1
1. What is the main innovation of LLaVA-Critic-R1 compared to traditional vision-language models?
It uses a larger model architecture than previous approaches
It combines critic and policy capabilities in a single model through reinforcement learning
It only focuses on improving evaluation capabilities
Q2
2. When applying test-time self-critique, what performance improvement did the model achieve on reasoning tasks?
+5.7% improvement
+10.2% improvement
+13.8% improvement
Q3
3. Why did the authors discard GPT-generated rationales in their RL training approach?
To reduce computational costs during training
To avoid knowledge distillation bias and encourage self-derived reasoning
Because the rationales were of poor quality
1/2

Paper 2

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

Published: 2025-09-01

Link: http://arxiv.org/pdf/2509.01215

1. 📘 Topic and Domain: Development of a distillation-free framework for adapting vision-language models to document conversion tasks in computer vision and natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous work in document conversion and vision-language models, proposes a novel two-stage framework that eliminates reliance on knowledge distillation from larger models.
3. ❓ Problem: Addresses the challenge of creating high-quality labeled datasets for training document conversion models without depending on distillation from existing models, which often introduces biases and limitations.
4. 🛠️ Methods: Implements a two-stage approach: (1) Uniform Format Warm-up Stage generates synthetic data with standardized formats, and (2) Iterative Self-improvement Stage uses filtering strategies to refine real-world document annotations.
5. 📊 Results and Evaluation: The resulting POINTS-Reader model outperforms many existing public and proprietary models, including larger ones, achieving state-of-the-art performance across various benchmarks, particularly excelling in table recognition tasks.

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

POINTS-Reader: Distillation-Free Document Conversion Framework Stage 1: Uniform Format Warm-up Stage Large Language Model Generation Unified Format: • Plain Text (MD) • Tables (HTML) HTML Templates & Rendering Vision-Language Model Training Generated Data Categories: 1. Plain text only 2. Text with mathematical formulas 3. Text with tables 4. Multi-column layouts with tables Stage 2: Iterative Self-improvement Stage Annotate Real-world Documents Rule-based Filtering Plain Text Filter: F1-score vs OCR reference Table Filter: Structure validation Formula Filter: Syntax correctness Model Retraining on Filtered Data Iterative Process: • Continuous improvement of model performance • Progressive enhancement of data quality • Multiple iterations until convergence POINTS-Reader Model High-quality document conversion without distillation Supports plain text, tables, and mathematical formulas Key Innovation ✓ Distillation-free approach ✓ Automated data generation ✓ Self-improvement mechanism ✓ Unified output format
Q1
1. What is the main innovation of POINTS-Reader compared to existing document conversion approaches?
It uses a larger model architecture than previous approaches
It eliminates the need for knowledge distillation from teacher models
It only works with simple document layouts
Q2
2. In the Uniform Format Warm-up Stage, what output format is used for representing tables and why?
LaTeX format because it's the most widely used
Markdown format because it's simple to implement
HTML format because it can handle complex structures like merged cells
Q3
3. During the Iterative Self-improvement Stage, what interesting observation was made about table and formula recognition?
The model's performance decreased over iterations
Performance improved even though only structural validity was checked, not content accuracy
The model could only improve on simple layouts
1/2

Paper 3

DCPO: Dynamic Clipping Policy Optimization

Published: 2025-09-02

Link: http://arxiv.org/pdf/2509.02333

1. 📘 Topic and Domain: Dynamic Clipping Policy Optimization (DCPO) for enhancing reasoning capabilities in large language models through reinforcement learning.
2. 💡 Previous Research and New Ideas: Based on GRPO and DAPO algorithms for reinforcement learning from verifiable rewards; proposes new dynamic clipping bounds and smooth advantage standardization.
3. ❓ Problem: Addressing zero gradients and inefficient training in existing approaches due to fixed clipping bounds and standardization of identical rewards.
4. 🛠️ Methods: Introduces dynamic clipping strategy that adjusts bounds based on token-specific probabilities, and smooth advantage standardization that standardizes rewards across cumulative training steps.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance on four benchmarks across four models, with 46.7% Avg@1 accuracy on AIME24, 28% improvement in nonzero advantage ratio, doubled training efficiency, and significantly reduced token clipping ratio.

DCPO: Dynamic Clipping Policy Optimization

DCPO: Dynamic Clipping Policy Optimization Workflow GRPO/DAPO Issues • Zero gradients • Fixed clipping bounds Dynamic Adaptive Clipping (DAC) |(r(x)-1)p(x)| ≤ ε Adaptive bounds based on token probabilities Better low-prob exploration Smooth Advantage Standardization (SAS) Cumulative + Current Weighted combination of step-wise & cumulative standardization Only Token Mean Loss (OTM) Response-level averaging Preserves relative advantage structure among responses Key Mathematical Components DAC Bounds: 0.5 + ½√max(1-4ε_low/q(x), 0) ≤ r(x) ≤ 0.5 + ½√(1+4ε_high/q(x)) SAS: Â^i_j = min(|ŜA^i_new,j|, |ŜA^i_total,j|) DCPO Training Process Generate Responses → Apply DAC + SAS + OTM → Update Policy Token-Level Benefits • Lower TCR (10x reduction) • Better rare token exploration • Stable clipping ratios • Enhanced diversity Response-Level Benefits • Higher RUR (+28%) • Reduced zero gradients • Better data utilization • Stable training Training Efficiency • 2x faster than DAPO • Less data waste • Consistent performance • Superior results Experimental Results AIME24: 38.8 (DCPO) vs 32.1 (GRPO) vs 31.6 (DAPO) Consistent improvements across 4 models and 4 benchmarks
Q1
1. What is the main innovation of DCPO compared to previous approaches?
Using fixed clipping bounds for all tokens
Dynamically adjusting clipping bounds based on token probabilities
Removing clipping bounds entirely
Q2
2. On the AIME24 benchmark with 32-time sampling, what performance improvement did DCPO-7B achieve over GRPO?
An increase from 32.1 to 38.8
An increase from 31.6 to 36.7
An increase from 36.7 to 46.7
Q3
3. What problem with previous approaches did DCPO's smooth advantage standardization technique address?
Too many model parameters
Slow training speed
Zero gradients from identical rewards