1. 📘 Topic and Domain: The paper explores reinforcement learning to enhance action prediction capabilities of GUI agents for interacting with graphical user interfaces.
2. 💡 Previous Research and New Ideas: Based on DeepSeek-R1's rule-based reinforcement learning approach, the paper introduces a novel application to multimodal large language models for GUI tasks, proposing a unified rule-based action reward system.
3. ❓ Problem: The paper addresses the limitations of supervised fine-tuning methods which require large labeled datasets and perform poorly on out-of-domain tasks for GUI agents.
4. 🛠️ Methods: The authors employ rule-based reinforcement learning with a three-component reward function (action type, coordinate accuracy, format) and carefully curated 136 high-quality training samples selected through a three-stage process.
5. 📊 Results and Evaluation: The model achieved significant improvements over baseline, with 15% better action type accuracy and 10.3% better grounding accuracy on in-domain tasks, while showing competitive performance with larger models on out-of-domain tasks using much less training data.