1. 📘 Topic and Domain: The paper presents V-Triune, a unified reinforcement learning system for vision-language models that combines both visual reasoning and perception tasks.
2. 💡 Previous Research and New Ideas: Prior research focused separately on either reasoning tasks (math, science) or perception tasks (detection, grounding), while this paper proposes a novel unified approach combining both through a triple-component system and dynamic IoU reward mechanism.
3. ❓ Problem: The paper addresses the challenge of training vision-language models to perform both reasoning and perception tasks effectively within a single unified framework, as previous approaches treated these tasks in isolation.
4. 🛠️ Methods: The paper implements a three-tier system: Sample-Level Data Formatting (for unified task inputs), Verifier-Level Reward Computation (for custom rewards), and Source-Level Metric Monitoring (for diagnostics), along with a Dynamic IoU reward for perception tasks.
5. 📊 Results and Evaluation: The resulting Orsta models achieved significant improvements on MEGA-Bench Core benchmark, with gains ranging from +2.1% to +14.1% across different model variants (7B and 32B), while also showing strong performance on downstream tasks like MMMU, MathVista, and COCO.