1. 📘 Topic and Domain: The paper focuses on developing an AI agent framework called Agent S2 for automating computer tasks through direct interaction with graphical user interfaces (GUIs) across operating systems and devices.
2. 💡 Previous Research and New Ideas: Based on previous monolithic and hierarchical methods for computer use agents, it introduces a novel compositional framework that combines generalist planning modules with specialist grounding experts, along with new Mixture-of-Grounding and Proactive Hierarchical Planning techniques.
3. ❓ Problem: The paper addresses three core limitations of current computer-use agents: imprecise GUI element grounding, difficulty with long-horizon task planning, and performance bottlenecks from relying solely on single generalist models.
4. 🛠️ Methods: Uses a compositional framework combining Manager (high-level planning), Worker (low-level execution), and specialized grounding experts (visual, textual, structural) along with proactive hierarchical planning that dynamically updates plans based on evolving observations.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance across multiple benchmarks: 18.9% and 32.7% relative improvements on OSWorld's 15-step and 50-step evaluations, 52.8% improvement on WindowsAgentArena, and 16.52% improvement on AndroidWorld compared to previous methods.