2025-05-23 Papers

1/2

Paper 1

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Published: 2025-05-22

Link: http://arxiv.org/pdf/2505.17022

1. 📘 Topic and Domain: The paper focuses on enhancing visual generation models' reasoning capabilities through reinforcement learning, specifically in the domain of text-to-image generation and computer vision.
2. 💡 Previous Research and New Ideas: The work builds upon the Generation Chain-of-Thought (GoT) approach but introduces a novel reinforcement learning framework to discover reasoning strategies autonomously, rather than relying on predefined templates.
3. ❓ Problem: The paper addresses the challenge of visual generation models struggling with complex prompts involving multiple objects, precise spatial relationships, and attributes, which requires explicit reasoning about semantic content and spatial layout.
4. 🛠️ Methods: The authors propose GoT-R1, a framework that applies reinforcement learning with a dual-stage multi-dimensional reward system leveraging MLLMs to evaluate both reasoning process and final output across semantic alignment, spatial accuracy, and visual quality.
5. 📊 Results and Evaluation: The framework demonstrated significant improvements on the T2I-CompBench benchmark, particularly in compositional tasks involving spatial relationships and attribute binding, with their GoT-R1-7B model achieving superior performance across multiple evaluation metrics.

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

GoT-R1: Visual Generation with Reinforcement Learning Text Prompt Input Generation Chain-of-Thought (GoT) Semantic-Spatial Reasoning Process Semantic Reward Prompt-Reasoning Spatial Reward Layout Evaluation Image Reward Reasoning-Image Group Relative Policy Optimization Generated Image Output
Q1
1. What is the main limitation of the previous Generation Chain-of-Thought (GoT) approach that GoT-R1 aims to overcome?
GoT could only generate black and white images
GoT relied on predefined templates which limited its reasoning abilities
GoT was too computationally expensive to run
Q2
2. How does GoT-R1's reward system evaluate the generation process?
It only evaluates the final image quality
It uses human raters to score each generation
It uses a dual-stage system evaluating both reasoning process and final output
Q3
3. Why does GoT-R1 convert coordinate data into visual bounding boxes when evaluating spatial relationships?
Because MLLMs show better spatial understanding with visual data than text coordinates
To reduce the computational cost of evaluation
To make the output more visually appealing to humans
1/2

Paper 2

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Published: 2025-05-22

Link: http://arxiv.org/pdf/2505.16933

1. 📘 Topic and Domain: The paper introduces LLaDA-V, a diffusion-based multimodal large language model for visual instruction tuning and image understanding.
2. 💡 Previous Research and New Ideas: Based on prior work in large language diffusion models (LLaDA) and visual instruction tuning, it proposes a novel purely diffusion-based approach rather than the dominant autoregressive paradigm.
3. ❓ Problem: The paper aims to develop an effective diffusion-based alternative to autoregressive multimodal language models for visual instruction tuning and image understanding tasks.
4. 🛠️ Methods: The approach combines a language diffusion model (LLaDA) with a vision encoder (SigLIP2) and MLP connector, using masked diffusion for training and a multi-stage training strategy including language-image alignment, visual instruction tuning, and multimodal reasoning enhancement.
5. 📊 Results and Evaluation: LLaDA-V achieved state-of-the-art performance among diffusion-based models and demonstrated superior data scalability compared to LLaMA3-V baseline, though slightly underperforming top autoregressive models like Qwen2-VL.

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning Stage 1 Language-Image Alignment Stage 2 Visual Instruction Tuning Stage 3 Multimodal Reasoning Train MLP Projector with LLaVA-Pretrain Single Image Training (10M samples) OneVision Training (2M samples) Reasoning Training (900K samples) Balanced Reasoning Training (Mixed Dataset) LLaDA-V Model
Q1
1. What is the main innovation of LLaDA-V compared to existing multimodal language models?
It uses a purely diffusion-based approach instead of autoregressive
It has a larger model size than previous models
It can process higher resolution images
Q2
2. Which training stage is unique to LLaDA-V's three-stage training strategy?
Language-image alignment stage
Visual instruction tuning stage
Multimodal reasoning enhancement stage
Q3
3. How did LLaDA-V perform compared to other models?
It outperformed all existing multimodal models
It achieved state-of-the-art among diffusion models but fell short of top autoregressive models
It performed worse than all baseline models
1/2

Paper 3

Risk-Averse Reinforcement Learning with Itakura-Saito Loss

Published: 2025-05-22

Link: http://arxiv.org/pdf/2505.16925

1. 📘 Topic and Domain: Risk-averse reinforcement learning using Itakura-Saito loss function for value function approximation in high-stakes applications like finance and healthcare.
2. 💡 Previous Research and New Ideas: Based on exponential utility-based risk-averse RL methods, proposing a novel Itakura-Saito divergence-based loss function to overcome numerical instabilities in existing approaches.
3. ❓ Problem: Existing exponential-utility RL approaches suffer from numerical instabilities due to exponentiation of value functions at each step, preventing reliable convergence.
4. 🛠️ Methods: Introduced a new loss function based on Itakura-Saito divergence to learn state-value and action-value functions, providing theoretical guarantees and scale invariance.
5. 📊 Results and Evaluation: The proposed IS loss outperformed existing alternatives across multiple scenarios including analytically tractable portfolio examples, deep hedging tasks, and robust combinatorial RL problems, showing better numerical stability and convergence.

Risk-Averse Reinforcement Learning with Itakura-Saito Loss

Risk-Averse Reinforcement Learning with Itakura-Saito Loss MDP Setup (S, A, r, p, s₀) Risk Aversion Exponential Utility Ẽα[X] Value Function Ṽπ(s), Q̃π(s,a) Itakura-Saito Loss Numerically Stable Scale Invariant Portfolio Optimization Deep Hedging Robust RL
Q1
1. What is the main advantage of the proposed Itakura-Saito loss compared to existing exponential-utility approaches?
It requires less computational resources
It provides better numerical stability and scale invariance
It works with any type of utility function
Q2
2. In which experimental scenario did the authors NOT test their proposed method?
Robot navigation tasks
Portfolio optimization
Deep hedging problems
Q3
3. What happens to the Itakura-Saito (IS) loss when dealing with small discrepancies between the V-function and its target?
It becomes equivalent to Mean Squared Error loss
It explodes exponentially
It approaches zero regardless of the error