2025-05-21 Papers

1/2

Paper 1

VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

Published: 2025-05-20

Link: http://arxiv.org/pdf/2505.14460

1. 📘 Topic and Domain: No-reference image quality assessment (NR-IQA) using reinforcement learning and vision-language models.

2. 💡 Previous Research and New Ideas: Based on DeepSeek-R1's reasoning capabilities in language models and traditional IQA approaches, proposing a novel reinforcement learning to rank (RL2R) method that treats image quality as relative rather than absolute.

3. ❓ Problem: Addressing the limitations of current NR-IQA methods, particularly their poor generalization across different distortion types and need for perceptual scale realignment in multi-dataset training.

4. 🛠️ Methods: Employs group relative policy optimization (GRPO) to generate multiple quality scores per image, using the Thurstone model to compute comparative probabilities and continuous fidelity rewards for reinforcement learning.

5. 📊 Results and Evaluation: VisualQuality-R1 outperformed existing methods across eight datasets, achieving higher SRCC/PLCC scores (0.791/0.831) while generating human-aligned quality descriptions without requiring perceptual scale realignment.

VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

1/2

Paper 2

Visual Agentic Reinforcement Fine-Tuning

Published: 2025-05-20

Link: http://arxiv.org/pdf/2505.14246

1. 📘 Topic and Domain: The paper focuses on Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), a framework for training large vision-language models (LVLMs) to use external tools like web search and code execution for complex visual reasoning tasks.

2. 💡 Previous Research and New Ideas: Based on previous research in language-only agentic abilities and reinforcement learning, the paper proposes a novel approach to enable multimodal models to use tools through reinforcement fine-tuning with verifiable rewards, extending beyond text-only capabilities.

3. ❓ Problem: The paper addresses the lack of multimodal agentic capabilities in open-source LVLMs, specifically their inability to use external tools for complex visual reasoning tasks.

4. 🛠️ Methods: The authors developed Visual-ARFT using reinforcement learning with verifiable rewards, created the Multimodal Agentic Tool Bench (MAT) for evaluation, and designed specific rewards for both searching and coding tasks.

5. 📊 Results and Evaluation: Visual-ARFT achieved significant improvements over baselines, with +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, outperforming GPT-4o and showing strong generalization capabilities on existing multi-hop QA benchmarks.

Visual Agentic Reinforcement Fine-Tuning

1/2

Paper 3

Latent Flow Transformer

Published: 2025-05-20

Link: http://arxiv.org/pdf/2505.14513

1. 📘 Topic and Domain: Development of a more efficient transformer architecture called Latent Flow Transformer (LFT) in the domain of large language models and deep learning.

2. 💡 Previous Research and New Ideas: Based on flow matching and diffusion models from image generation, proposing a novel idea to replace multiple transformer layers with a single learned transport operator.

3. ❓ Problem: Addressing the inefficiency of traditional transformers that use many discrete layers, which leads to high computational and memory demands.

4. 🛠️ Methods: Introduces Flow Walking (FW) algorithm and Recoupling Ratio metric to replace multiple transformer layers with a single flow-based layer while maintaining model performance.

5. 📊 Results and Evaluation: On Pythia-410M model, LFT successfully compressed 12 of 24 layers into one while achieving better performance than skipping 3 layers (KL divergence of 0.736 vs 0.932), demonstrating significant parameter reduction with minimal performance loss.