1. 📘 Topic and Domain: OceanGym is a benchmark environment for testing and evaluating AI agents in simulated underwater environments.
2. 💡 Previous Research and New Ideas: Based on prior work in embodied AI and simulation environments for ground/aerial domains, this paper introduces the first comprehensive benchmark specifically for underwater scenarios.
3. ❓ Problem: The paper addresses the lack of standardized testing environments for underwater AI agents, which face unique challenges like low visibility, dynamic currents, and complex perception requirements.
4. 🛠️ Methods: The authors created a simulated underwater environment with 8 task domains, using Multi-modal Large Language Models (MLLMs) as agents that integrate perception, memory, and decision-making capabilities.
5. 📊 Results and Evaluation: Results showed significant performance gaps between MLLMs and human experts, with MLLMs struggling particularly in low-visibility conditions (14.8% success rate) and having difficulties with sonar data interpretation, object distinction, and consistent decision-making over extended missions.