1. 📘 Topic and Domain: The paper presents MiMo-VL, a vision-language model for multimodal AI systems, focusing on visual understanding, reasoning, and GUI interaction.
2. 💡 Previous Research and New Ideas: Based on previous vision-language models and RLHF research, it introduces Mixed On-policy Reinforcement Learning (MORL) and incorporates high-quality reasoning data in pre-training stages.
3. ❓ Problem: The paper aims to build a compact yet powerful vision-language model that can handle complex visual understanding, multimodal reasoning, and GUI interaction tasks while maintaining strong performance across diverse capabilities.
4. 🛠️ Methods: Uses a four-stage pre-training process (2.4 trillion tokens) combined with Mixed On-policy Reinforcement Learning (MORL), incorporating diverse reward signals and a native-resolution Vision Transformer architecture.
5. 📊 Results and Evaluation: MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35/40 tasks, scores 59.4 on OlympiadBench, achieves 56.1 on OSWorld-G, and shows strong performance across 50+ evaluation benchmarks, setting new standards for open-source vision-language models.