2025-05-02 Papers

1/2

Paper 1

DeepCritic: Deliberate Critique with Large Language Models

Published: 2025-05-01

Link: http://arxiv.org/pdf/2505.00662

1. 📘 Topic and Domain: The paper focuses on enhancing the mathematical critique capabilities of Large Language Models (LLMs), specifically in their ability to evaluate and provide feedback on mathematical reasoning solutions.
2. 💡 Previous Research and New Ideas: Based on existing LLM critics that provide shallow critiques, the paper proposes a novel two-stage framework called DeepCritic that enables LLMs to generate more deliberate and thorough critiques of mathematical solutions.
3. ❓ Problem: The paper addresses the limitation of current LLM critics that provide superficial critiques of mathematical solutions, leading to low judgment accuracy and insufficient feedback for error correction.
4. 🛠️ Methods: The paper employs a two-stage approach: first using Qwen2.5-72B-Instruct to generate 4.5K long-form critiques for supervised fine-tuning, then applying reinforcement learning using either human-labeled data or automatically annotated data via Monte Carlo sampling.
5. 📊 Results and Evaluation: The developed DeepCritic model outperformed existing LLM critics (including GPT-4o) on various error identification benchmarks and demonstrated effectiveness in helping LLM generators refine erroneous solutions through detailed feedback.

DeepCritic: Deliberate Critique with Large Language Models

DeepCritic: Two-Stage Training Pipeline Stage 1: Critique Teaching (SFT) 1. Initial Critique Generation (Critique each step si independently) using Qwen2.5-72B-Instruct 2. In-Depth Critique Generation (Critique initial critique or re-evaluate step si) (Filter based on ground truth) using Qwen2.5-72B-Instruct 3. Final Critique Synthesis (Merge initial & in-depth critiques) using Qwen2.5-72B-Instruct + ICL 4.5K Seed Critique Data (Long-form, Deliberate) 4. Supervised Fine-Tuning (SFT) DeepCritic-7B-SFT Stage 2: Critique Incentivization (RL) RL Data Source Human-Annotated Data (e.g., PRM800K) (40.7K samples) OR Auto-Labeled Data 1. Generate Solutions (NuminaMath-CoT) 2. Monte Carlo Sampling based Correctness Estimation (using Qwen2.5-7B) (14.2K samples) Reinforcement Learning (RL) (e.g., GRPO) (Accuracy Reward) DeepCritic-7B -RL-PRM800K DeepCritic-7B -RL-Numina Input Model for RL Final DeepCritic Models (Enhanced Math Critique Ability)
Q1
1. What is the main problem with existing LLM critics in the math domain that the DeepCritic framework aims to solve?
They are too computationally expensive to run.
They provide critiques that are superficial and lack in-depth analysis on each step.
They can only critique correct solutions, not incorrect ones.
Q2
2. The DeepCritic framework employs a two-stage training pipeline. What are these two stages in order?
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL).
Reinforcement Learning (RL) followed by Supervised Fine-Tuning (SFT).
Generating initial critiques followed by generating final answers.
Q3
3. How did the DeepCritic model perform compared to existing LLM critics (including GPT-4o and DeepSeek-R1-Distill models) on various error identification benchmarks?
It performed significantly worse across all benchmarks.
It performed comparably, showing similar accuracy.
It significantly outperformed them on various benchmarks.
1/2

Paper 2

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Published: 2025-05-01

Link: http://arxiv.org/pdf/2505.00703

1. 📘 Topic and Domain: The paper focuses on enhancing text-to-image generation through reasoning capabilities using chain-of-thought (CoT) approaches in computer vision and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on previous work in language model reasoning and visual generation, the paper introduces a novel bi-level CoT approach combining semantic-level planning and token-level generation, which is new to image generation.
3. ❓ Problem: The paper addresses the challenge of incorporating reasoning capabilities into text-to-image generation models to improve their understanding of complex prompts and generation quality.
4. 🛠️ Methods: The authors develop BiCoT-GRPO, a reinforcement learning framework that jointly optimizes both semantic-level and token-level CoT, using an ensemble of vision experts as reward models.
5. 📊 Results and Evaluation: The resulting model T2I-R1 achieved 13% improvement on T2I-CompBench and 19% improvement on WISE benchmark, surpassing state-of-the-art model FLUX.1.

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

T2I-R1 Method: Reinforcing Generation via Bi-Level CoT & GRPO Base Model: ULM (e.g., Janus-Pro) Input: Image Prompt (p) + Reasoning Instruction Bi-Level CoT Generation (Using ULM πθold) 1. Semantic-level CoT (s) ULM generates textual reasoning/planning "How does the whole image look like?" s = {s1, s2, ..., s|s|} 2. Token-level CoT (t) Conditioned on Prompt (p) + Semantic CoT (s) ULM generates image tokens patch-by-patch "How does the next patch look like?" t = {t1, t2, ..., tM} Image Decoder (D) (Pass s) Generate G responses {oi = (si, ti)} → {Ii} per prompt for Group Computation Generated Image (Ii) Ensemble of Vision Expert Rewards HPM (Aesthetics, Alignment) Object Detector (Existence, Relations) VQA Model (Attributes) ORM (Prompt Alignment) Output: Averaged Rewards (R1, R2, ..., RG) BiCoT-GRPO Optimization Step 1. Compute Group-Relative Advantages (Ai) Normalize rewards within the group G 2. Calculate Policy Ratio ri,j(θ) for oi=(si, ti) Ratio = πθ(oi,j | ...) / πθold(oi,j | ...) 3. Update ULM (πθ) via GRPO Objective (Eq. 2) Maximize: Clipped Advantage Term - β * DKL(πθ || πref) Update ULM (θ)
Q1
1. According to the paper, what are the two distinct levels of Chain-of-Thought (CoT) reasoning identified for enhancing text-to-image generation?
Global-level CoT and Local-level CoT
Semantic-level CoT and Token-level CoT
Textual CoT and Visual CoT
Q2
2. What is the name of the reinforcement learning framework introduced in the paper to jointly optimize both levels of CoT?
DualCoT-PPO
BiCoT-GRPO
Ensemble-RL
Q3
3. Which benchmarks did T2I-R1 achieve significant performance improvements on compared to baseline and state-of-the-art models?
MS COCO and Visual Genome
T2I-CompBench and WISE
CLEVR and VQA
1/2

Paper 3

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Published: 2025-04-30

Link: http://arxiv.org/pdf/2504.21850

1. 📘 Topic and Domain: The paper focuses on improving multimodal large language models' ability to handle complex visual-language tasks through a novel compositional training approach.
2. 💡 Previous Research and New Ideas: The paper builds on previous visual instruction tuning research but proposes a new approach called COMPACT that explicitly controls for compositional complexity in training data rather than just scaling data volume.
3. ❓ Problem: The paper addresses how current multimodal models struggle with complex tasks requiring multiple capabilities simultaneously (like recognizing objects, counting them, and understanding spatial relationships together).
4. 🛠️ Methods: The authors develop a data generation pipeline that creates training examples combining 10 atomic visual capabilities into progressively more complex tasks (k=1,2,3 capabilities), using Gemini for generation and verification.
5. 📊 Results and Evaluation: Using only 10% of standard training data, COMPACT achieved comparable or better performance than full-scale visual instruction tuning, with particularly strong improvements on complex tasks (83.3% improvement on MMStar and 94.0% on MM-Vet for tasks requiring 4+ capabilities).

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

COMPACT: Workflow for Compositional Visual Capability Tuning COMPACT Data Generation Pipeline Define Atomic Visual Capabilities (10) (e.g., Color, Count, Spatial Rel.) Step 1: Capability Sampling Input: Image from LLaVA-665K Sample k ∈ {1, 2, 3} atomic capabilities (ensure diversity) Step 2: Conversation Gen. Input: Image + Sampled k Caps Prompt Gemini-2.0-Flash Generate QA pair integrating exactly k capabilities + Confidence Step 3: Quality Verification Input: Generated QA + Conf. Filter & Verify: Quality, Grounding, Exact k Capabilities Reject/Retry Output: Verified QA Pairs Atomic Capabilities (Examples) Color Shape Object Rec. Action Rec. Text Rec. Spatial Rec. Counting Spatial Rel. Obj Interaction Scene Underst. Step 4: Dataset Assembly COMPACT Generated Compositional Data (e.g., 32K, Balanced k=1,2,3) Small Subset of LLaVA-665K VIT Data (e.g., 5% for Instruction Following) + Final COMPACT Training Dataset (e.g., 65K) Model Training & Evaluation Training Input: Pre-VIT MLLM Checkpoint (e.g., LLaVA-v1.5-7B) Evaluation Evaluate on Benchmarks (MM-Vet, MMStar, etc.)
Q1
1. According to the paper, what is a primary limitation of current MLLMs that COMPACT aims to address?
Their inability to process high-resolution images efficiently.
Their struggle with complex visual tasks that require combining multiple capabilities.
Their lack of diverse visual instruction tuning datasets for simple tasks.
Q2
2. What is the key distinguishing feature of COMPACT's training data generation compared to traditional Visual Instruction Tuning (VIT)?
It focuses primarily on generating a much larger volume of data.
It explicitly controls and balances the compositional complexity (number of combined capabilities) of training examples.
It relies exclusively on human annotation for data quality verification.
Q3
3. COMPACT demonstrates improved performance, particularly on complex multi-capability tasks, while using what fraction of the LLaVA-665K VIT data budget?
More than 50%
Approximately 25%
Less than 10%