2025-05-23 Papers

1/2

Paper 1

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Published: 2025-05-22

Link: http://arxiv.org/pdf/2505.17022

1. 📘 Topic and Domain: The paper focuses on enhancing visual generation models' reasoning capabilities through reinforcement learning, specifically in the domain of text-to-image generation and computer vision.

2. 💡 Previous Research and New Ideas: The work builds upon the Generation Chain-of-Thought (GoT) approach but introduces a novel reinforcement learning framework to discover reasoning strategies autonomously, rather than relying on predefined templates.

3. ❓ Problem: The paper addresses the challenge of visual generation models struggling with complex prompts involving multiple objects, precise spatial relationships, and attributes, which requires explicit reasoning about semantic content and spatial layout.

4. 🛠️ Methods: The authors propose GoT-R1, a framework that applies reinforcement learning with a dual-stage multi-dimensional reward system leveraging MLLMs to evaluate both reasoning process and final output across semantic alignment, spatial accuracy, and visual quality.

5. 📊 Results and Evaluation: The framework demonstrated significant improvements on the T2I-CompBench benchmark, particularly in compositional tasks involving spatial relationships and attribute binding, with their GoT-R1-7B model achieving superior performance across multiple evaluation metrics.

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

1/2

Paper 2

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Published: 2025-05-22

Link: http://arxiv.org/pdf/2505.16933

1. 📘 Topic and Domain: The paper introduces LLaDA-V, a diffusion-based multimodal large language model for visual instruction tuning and image understanding.

2. 💡 Previous Research and New Ideas: Based on prior work in large language diffusion models (LLaDA) and visual instruction tuning, it proposes a novel purely diffusion-based approach rather than the dominant autoregressive paradigm.

3. ❓ Problem: The paper aims to develop an effective diffusion-based alternative to autoregressive multimodal language models for visual instruction tuning and image understanding tasks.

4. 🛠️ Methods: The approach combines a language diffusion model (LLaDA) with a vision encoder (SigLIP2) and MLP connector, using masked diffusion for training and a multi-stage training strategy including language-image alignment, visual instruction tuning, and multimodal reasoning enhancement.

5. 📊 Results and Evaluation: LLaDA-V achieved state-of-the-art performance among diffusion-based models and demonstrated superior data scalability compared to LLaMA3-V baseline, though slightly underperforming top autoregressive models like Qwen2-VL.

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

1/2

Paper 3

Risk-Averse Reinforcement Learning with Itakura-Saito Loss

Published: 2025-05-22

Link: http://arxiv.org/pdf/2505.16925

1. 📘 Topic and Domain: Risk-averse reinforcement learning using Itakura-Saito loss function for value function approximation in high-stakes applications like finance and healthcare.

2. 💡 Previous Research and New Ideas: Based on exponential utility-based risk-averse RL methods, proposing a novel Itakura-Saito divergence-based loss function to overcome numerical instabilities in existing approaches.

3. ❓ Problem: Existing exponential-utility RL approaches suffer from numerical instabilities due to exponentiation of value functions at each step, preventing reliable convergence.

4. 🛠️ Methods: Introduced a new loss function based on Itakura-Saito divergence to learn state-value and action-value functions, providing theoretical guarantees and scale invariance.

5. 📊 Results and Evaluation: The proposed IS loss outperformed existing alternatives across multiple scenarios including analytically tractable portfolio examples, deep hedging tasks, and robust combinatorial RL problems, showing better numerical stability and convergence.