2025-09-04 Papers

1/2

Paper 1

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Published: 2025-08-30

Link: http://arxiv.org/pdf/2509.00676

1. 📘 Topic and Domain: The paper explores vision-language modeling, specifically challenging the conventional separation between critic models (which evaluate outputs) and policy models (which generate responses).

2. 💡 Previous Research and New Ideas: Based on previous research separating critic and policy models, this paper proposes reorganizing preference-labeled critic datasets into verifiable training signals and using reinforcement learning to create a unified model capable of both evaluation and generation.

3. ❓ Problem: The paper aims to solve the traditional limitation of treating critic models solely as evaluators rather than generators, seeking to create a single model that excels at both tasks.

4. 🛠️ Methods: The authors use reinforcement learning on a base generative model (Qwen-2.5-VL-7B) with reorganized preference-labeled critic datasets, creating LLaVA-Critic-R1 and its enhanced version LLaVA-Critic-R1+.

5. 📊 Results and Evaluation: The model achieved significant improvements over its base model (+5.7% average gain across 26 benchmarks), reached state-of-the-art performance on MMMU (71.9) at 7B scale, and demonstrated +13.8% improvement on reasoning tasks through self-critique at test time.

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

1/2

Paper 2

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

Published: 2025-09-01

Link: http://arxiv.org/pdf/2509.01215

1. 📘 Topic and Domain: Development of a distillation-free framework for adapting vision-language models to document conversion tasks in computer vision and natural language processing.

2. 💡 Previous Research and New Ideas: Based on previous work in document conversion and vision-language models, proposes a novel two-stage framework that eliminates reliance on knowledge distillation from larger models.

3. ❓ Problem: Addresses the challenge of creating high-quality labeled datasets for training document conversion models without depending on distillation from existing models, which often introduces biases and limitations.

4. 🛠️ Methods: Implements a two-stage approach: (1) Uniform Format Warm-up Stage generates synthetic data with standardized formats, and (2) Iterative Self-improvement Stage uses filtering strategies to refine real-world document annotations.

5. 📊 Results and Evaluation: The resulting POINTS-Reader model outperforms many existing public and proprietary models, including larger ones, achieving state-of-the-art performance across various benchmarks, particularly excelling in table recognition tasks.

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

1/2

Paper 3

DCPO: Dynamic Clipping Policy Optimization

Published: 2025-09-02

Link: http://arxiv.org/pdf/2509.02333

1. 📘 Topic and Domain: Dynamic Clipping Policy Optimization (DCPO) for enhancing reasoning capabilities in large language models through reinforcement learning.

2. 💡 Previous Research and New Ideas: Based on GRPO and DAPO algorithms for reinforcement learning from verifiable rewards; proposes new dynamic clipping bounds and smooth advantage standardization.

3. ❓ Problem: Addressing zero gradients and inefficient training in existing approaches due to fixed clipping bounds and standardization of identical rewards.

4. 🛠️ Methods: Introduces dynamic clipping strategy that adjusts bounds based on token-specific probabilities, and smooth advantage standardization that standardizes rewards across cumulative training steps.

5. 📊 Results and Evaluation: Achieved state-of-the-art performance on four benchmarks across four models, with 46.7% Avg@1 accuracy on AIME24, 28% improvement in nonzero advantage ratio, doubled training efficiency, and significantly reduced token clipping ratio.