2025-05-21 Papers

1/2

Paper 1

VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

Published: 2025-05-20

Link: http://arxiv.org/pdf/2505.14460

1. 📘 Topic and Domain: No-reference image quality assessment (NR-IQA) using reinforcement learning and vision-language models.
2. 💡 Previous Research and New Ideas: Based on DeepSeek-R1's reasoning capabilities in language models and traditional IQA approaches, proposing a novel reinforcement learning to rank (RL2R) method that treats image quality as relative rather than absolute.
3. ❓ Problem: Addressing the limitations of current NR-IQA methods, particularly their poor generalization across different distortion types and need for perceptual scale realignment in multi-dataset training.
4. 🛠️ Methods: Employs group relative policy optimization (GRPO) to generate multiple quality scores per image, using the Thurstone model to compute comparative probabilities and continuous fidelity rewards for reinforcement learning.
5. 📊 Results and Evaluation: VisualQuality-R1 outperformed existing methods across eight datasets, achieving higher SRCC/PLCC scores (0.791/0.831) while generating human-aligned quality descriptions without requiring perceptual scale realignment.

VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

VisualQuality-R1 Workflow Input Image Pair Group Relative Policy Optimization (GRPO) K Quality Scores Generated Thurstone Model Processing Comparative Probabilities Fidelity Measure Reward Quality Assessment Output
Q1
1. What is the key innovation in how VisualQuality-R1 approaches image quality assessment compared to traditional methods?
It treats image quality as an absolute measurement using regression
It treats image quality as a relative measurement using reinforcement learning to rank
It only focuses on generating textual descriptions of image quality
Q2
2. What advantage does VisualQuality-R1 have when training on multiple datasets?
It requires no perceptual scale realignment between datasets
It can only be trained on one dataset at a time
It needs manual calibration for each new dataset
Q3
3. What happens to the prediction variability of VisualQuality-R1 during training?
It remains constant throughout training
It increases as training progresses
It steadily decreases showing more stable predictions
1/2

Paper 2

Visual Agentic Reinforcement Fine-Tuning

Published: 2025-05-20

Link: http://arxiv.org/pdf/2505.14246

1. 📘 Topic and Domain: The paper focuses on Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), a framework for training large vision-language models (LVLMs) to use external tools like web search and code execution for complex visual reasoning tasks.
2. 💡 Previous Research and New Ideas: Based on previous research in language-only agentic abilities and reinforcement learning, the paper proposes a novel approach to enable multimodal models to use tools through reinforcement fine-tuning with verifiable rewards, extending beyond text-only capabilities.
3. ❓ Problem: The paper addresses the lack of multimodal agentic capabilities in open-source LVLMs, specifically their inability to use external tools for complex visual reasoning tasks.
4. 🛠️ Methods: The authors developed Visual-ARFT using reinforcement learning with verifiable rewards, created the Multimodal Agentic Tool Bench (MAT) for evaluation, and designed specific rewards for both searching and coding tasks.
5. 📊 Results and Evaluation: Visual-ARFT achieved significant improvements over baselines, with +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, outperforming GPT-4o and showing strong generalization capabilities on existing multi-hop QA benchmarks.

Visual Agentic Reinforcement Fine-Tuning

Visual Agentic Reinforcement Fine-Tuning Workflow Input Image + Question Visual-ARFT Policy Model + Reference Model Search Branch Task Decomposition Search Engine Query Information Integration Coding Branch Problem Analysis Code Generation Code Execution Final Answer
Q1
1. What is the main innovation of Visual-ARFT compared to previous approaches?
It introduces a new type of large language model architecture
It enables multimodal models to use external tools through reinforcement learning
It improves text-only processing capabilities of vision models
Q2
2. How does Visual-ARFT handle reward signals during training?
It relies on human feedback for each model prediction
It uses a learned reward model to evaluate outputs
It employs verifiable rewards based on objective correctness checks
Q3
3. What was the most significant performance improvement achieved by Visual-ARFT?
29.3% F1 score improvement on multi-hop QA benchmarks
18.6% F1 score improvement on MAT-Coding
10.3% F1 score improvement on MAT-Search
1/2

Paper 3

Latent Flow Transformer

Published: 2025-05-20

Link: http://arxiv.org/pdf/2505.14513

1. 📘 Topic and Domain: Development of a more efficient transformer architecture called Latent Flow Transformer (LFT) in the domain of large language models and deep learning.
2. 💡 Previous Research and New Ideas: Based on flow matching and diffusion models from image generation, proposing a novel idea to replace multiple transformer layers with a single learned transport operator.
3. ❓ Problem: Addressing the inefficiency of traditional transformers that use many discrete layers, which leads to high computational and memory demands.
4. 🛠️ Methods: Introduces Flow Walking (FW) algorithm and Recoupling Ratio metric to replace multiple transformer layers with a single flow-based layer while maintaining model performance.
5. 📊 Results and Evaluation: On Pythia-410M model, LFT successfully compressed 12 of 24 layers into one while achieving better performance than skipping 3 layers (KL divergence of 0.736 vs 0.932), demonstrating significant parameter reduction with minimal performance loss.

Latent Flow Transformer

Latent Flow Transformer Workflow Original Transformer Multiple Layers Layer Selection Using Recoupling Ratio Compressed Model Single Flow Layer Standard Flow Matching (SFM) Flow Walking (FW) Inference k-step Integration Performance Metrics: • KL Divergence between predicted and original hidden states • NMSE (Normalized Mean Squared Error)
Q1
1. What is the main innovation of the Latent Flow Transformer (LFT) compared to traditional transformers?
It uses more layers than traditional transformers
It replaces multiple transformer layers with a single learned transport operator
It completely eliminates the need for attention mechanisms
Q2
2. According to the paper's analysis using the Recoupling Ratio, which transformer layers are most amenable to compression?
The first few layers
The middle layers
The final layers
Q3
3. When testing on the Pythia-410M model, what was the most significant compression achieved while maintaining better performance than the baseline?
Compressed 6 layers into one
Compressed 12 layers into one
Compressed 18 layers into one