2025-10-01 Papers

1/2

Paper 1

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

Published: 2025-09-29

Link: http://arxiv.org/pdf/2509.25541

1. 📘 Topic and Domain: Vision-language model (VLM) self-improvement through gamified self-play training, focusing on computer vision and machine learning.
2. 💡 Previous Research and New Ideas: Based on reinforcement learning and self-play approaches like AlphaGo, introduces a novel framework called Vision-Zero that enables VLMs to improve through competitive visual games without human annotation.
3. ❓ Problem: Addresses the high cost and scalability limitations of current VLM training methods that rely heavily on human-curated datasets and annotations.
4. 🛠️ Methods: Implements a "Who Is the Spy" game framework where models engage in strategic reasoning across multiple roles, combined with Iterative Self-Play Policy Optimization (Iterative-SPO) that alternates between self-play and reinforcement learning.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing models trained on human-annotated datasets while significantly reducing training costs.

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

Vision-Zero: VLM Self-Improvement via Strategic Gamified Self-Play Label-Free Data Input • CLEVR Synthetic Scenes • Chart Data (ChartQA) • Real-World Images • Domain-Agnostic Strategic Game Environment "Who Is the Spy?" Game • Clue Stage: Strategic Reasoning • Decision Stage: Spy Identification • Multi-role Interaction Iterative-SPO Algorithm Alternating Training: • Self-Play (Clue Stage) • RLVR (Decision Stage) • Sustainable Improvement Clue Stage: Self-Play Zero-Sum Reward Design: • Spy: r_s = -β(v_s - v̄_c) • Civilian: r_cj = β/nc(v_s - v̄_c) - λ(v_cj - v̄_c) Role Advantage Estimation (RAE) KL-regularized Policy Gradient Decision Stage: RLVR Discrete Reward: • Correct vote: +1 • Uncertain (N/A): -0.5 • Wrong vote: -1 Group Normalization & GRPO Stage Switching Hysteresis Thresholds: • τ↑_acc = 0.9 • τ↑_err = 0.4 • τ↑_na = 0.5 Exponential Moving Avg Training Process 1. Generate image pairs (Ic, Is) → 2. Multi-agent gameplay → 3. Collect votes & clues 4. Compute rewards → 5. Update policies → 6. Switch stages based on performance 7. Iterate for sustainable improvement Key Results • SOTA on reasoning tasks • Superior chart QA performance • Mitigates negative transfer • Cost-efficient training Key Advantages • Zero human annotation • Domain-agnostic inputs • Scalable self-improvement • Multi-capability enhancement Performance MathVision: +3% ChartQA: +1.1% LogicVista: +2.9% Win Rate: 50% → 71% First zero-human-in-the-loop VLM training paradigm with sustainable performance gains
Q1
1. What is the main innovation of Vision-Zero's training approach compared to traditional VLM training methods?
It uses pre-existing human-annotated datasets more efficiently
It relies on competitive gameplay between models without human annotation
It combines multiple VLMs to cross-validate each other's outputs
Q2
2. In the 'Who Is the Spy' game framework of Vision-Zero, what happens during the Clue Stage?
Models directly identify the spy among players
Players vote on who they think is lying
Players provide verbal descriptions of their images while trying to avoid suspicion
Q3
3. Why does Vision-Zero alternate between Self-Play and RLVR in its Iterative-SPO algorithm?
To reduce computational costs during training
To prevent the model from stagnating in strategic equilibrium and ensure continuous improvement
To validate the model's performance against human benchmarks
1/2

Paper 2

OceanGym: A Benchmark Environment for Underwater Embodied Agents

Published: 2025-09-30

Link: http://arxiv.org/pdf/2509.26536

1. 📘 Topic and Domain: OceanGym is a benchmark environment for testing and evaluating AI agents in simulated underwater environments.
2. 💡 Previous Research and New Ideas: Based on prior work in embodied AI and simulation environments for ground/aerial domains, this paper introduces the first comprehensive benchmark specifically for underwater scenarios.
3. ❓ Problem: The paper addresses the lack of standardized testing environments for underwater AI agents, which face unique challenges like low visibility, dynamic currents, and complex perception requirements.
4. 🛠️ Methods: The authors created a simulated underwater environment with 8 task domains, using Multi-modal Large Language Models (MLLMs) as agents that integrate perception, memory, and decision-making capabilities.
5. 📊 Results and Evaluation: Results showed significant performance gaps between MLLMs and human experts, with MLLMs struggling particularly in low-visibility conditions (14.8% success rate) and having difficulties with sonar data interpretation, object distinction, and consistent decision-making over extended missions.

OceanGym: A Benchmark Environment for Underwater Embodied Agents

OceanGym: Underwater Embodied Agent Workflow Environment Setup Unreal Engine 5.3 800m × 800m Ocean Task Categories Perception Tasks Decision Tasks Agent Framework MLLM-driven Memory-augmented Evaluation Distance-based Scoring Accuracy Metrics Perception Tasks Multi-view Perception Context-based Perception RGB + Sonar Images 6-direction sensors Decision Tasks 8 Underwater Scenarios Search & Inspection Navigation & Docking Continuous 3D Control Agent Components Perception Encoder Memory System Action Decoder Language Encoder Data Flow and Processing Input RGB + Sonar Instructions Perception MLLM Processing Memory Sliding Window K Steps Decision Action Selection Output Actions Responses Environment Conditions Shallow Water (50m) Deep Water (500m) High Illumination Low Illumination Dynamic Ocean Currents & Limited Visibility Evaluated Models GPT-4o-mini Gemini-2.5 Qwen2.5-VL-7B MiniCPM-V-4.5 vs Human Performance Key Experimental Findings Performance Gap MLLMs vs Humans 14.8% success rate in deep water Sonar Limitations MLLMs struggle with sonar interpretation vs human experts Memory Transfer Cross-task transfer improves performance in challenging conditions Scaling Effects Extended exploration improves performance until plateau
Q1
1. What was the most significant challenge faced by MLLMs in underwater environments according to the paper's results?
Battery life limitations of underwater vehicles
Low visibility conditions leading to poor performance
Communication delays with surface control stations
Q2
2. Which unique feature of OceanGym sets it apart from other embodied AI benchmarks?
Its focus on aerial drone navigation
Its integration with real underwater vehicles
Its combination of both optical and sonar data processing
Q3
3. What was the performance gap between human experts and MLLMs in shallow water environments?
Humans achieved 100% while MLLMs averaged 18.4%
Humans achieved 80% while MLLMs averaged 50%
Humans achieved 90% while MLLMs averaged 30%
1/2

Paper 3

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

Published: 2025-09-30

Link: http://arxiv.org/pdf/2509.25760

1. 📘 Topic and Domain: The paper focuses on developing truthful Large Language Models (LLMs) through reinforcement learning, addressing hallucination and uncertainty in natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous work in LLM fine-tuning and reinforcement learning, it proposes a novel ternary reward system that distinguishes between correct answers, hallucinations, and abstentions, unlike traditional binary reward approaches.
3. ❓ Problem: The paper aims to solve LLMs' tendency to hallucinate or provide incorrect information rather than admitting uncertainty when faced with questions beyond their knowledge.
4. 🛠️ Methods: The authors implement TruthRL using GRPO (General Reinforcement learning from Policy Optimization) with a ternary reward system that rewards correct answers, penalizes hallucinations, and treats abstentions neutrally.
5. 📊 Results and Evaluation: Compared to vanilla RL, TruthRL reduced hallucinations by 28.9% and improved truthfulness by 21.1% across four knowledge-intensive benchmarks, demonstrating consistent gains across various backbone models in both retrieval and non-retrieval setups.

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

TruthRL Methodology Flow Problem Identification LLMs hallucinate instead of admitting uncertainty Knowledge Boundary Probing • Sample 256 responses • Identify OOK questions • Relabel with "I don't know" Baseline Methods • Vanilla SFT • RFT (Rejection sampling) • R-Tuning • Vanilla RL (Binary reward) TruthRL Framework GRPO + Ternary Reward Truthfulness = w₁·Acc + w₂·Unc - w₃·Hall Direct optimization for truthfulness rather than accuracy alone Ternary Reward Design +1: Correct answers 0: Uncertain responses -1: Hallucinations Encourages abstention over guessing Enhanced Variants Knowledge-enhanced: +1 for abstention on OOK Reasoning-enhanced: Additional reasoning quality evaluation signals Training Process Online RL with GRPO • Group sampling (G responses) • Advantage estimation • Policy optimization Evaluation Setup Datasets: CRAG, NQ, HotpotQA, MuSiQue Models: Llama3.1-8B, Qwen2.5-7B Settings: With/Without Retrieval Metrics: Truthfulness, Hallucination Key Results 28.9% ↓ Hallucination Rate 21.1% ↑ Truthfulness Score Better knowledge boundary recognition Analysis • Robust to hallucination-baiting • More confident in correct answers • Scalable across model sizes • Simple ternary > complex designs Ablation Studies Binary vs Ternary rewards Online vs Offline RL Reward design variants Comparative Analysis vs Vanilla SFT/RL vs Knowledge-enhanced baselines Confidence & uncertainty patterns Robustness Testing Different LLM judges Multiple model scales Hallucination-baiting questions Future Directions Reasoning quality integration Multi-objective optimization Advanced reward designs Core Innovation Shift from accuracy-driven to truthfulness-driven training Explicit reward for abstention vs hallucination distinction Enables models to recognize knowledge boundaries
Q1
1. What is the key innovation in TruthRL's reward system compared to traditional approaches?
It uses a binary reward system of correct/incorrect only
It uses a ternary reward system distinguishing between correct answers, hallucinations, and abstentions
It only rewards correct answers and ignores all other responses
Q2
2. What was the primary impact of implementing TruthRL on model performance?
It improved accuracy but increased hallucinations
It reduced accuracy but eliminated all hallucinations
It reduced hallucinations by 28.9% while improving truthfulness by 21.1%
Q3
3. When evaluating on difficult questions where almost no method provides correct answers, how did TruthRL perform?
It produced minimal hallucinations (15.5%) while generating uncertain responses for most cases (84.5%)
It achieved 100% accuracy on all difficult questions
It produced high hallucinations similar to other baseline models