1. 📘 Topic and Domain: A large-scale multi-domain and multi-modal dataset called OmniWorld for 4D world modeling, focusing on computer vision and machine learning.
2. 💡 Previous Research and New Ideas: Based on existing datasets like Sintel, KITTI, and RealEstate10K which lack diversity and dynamic complexity; proposes a new comprehensive dataset combining synthetic game data with real-world footage across multiple domains.
3. ❓ Problem: Addresses the lack of high-quality, diverse data for training and evaluating 4D world modeling systems, particularly for tasks requiring complex spatial-temporal understanding.
4. 🛠️ Methods: Created OmniWorld by combining self-collected game footage (OmniWorld-Game) with curated public datasets, annotating them with depth maps, camera poses, text captions, optical flow, and foreground masks using specialized pipelines.
5. 📊 Results and Evaluation: Fine-tuning existing models on OmniWorld significantly improved their performance across tasks like depth estimation and camera-controlled video generation, with quantitative improvements shown on multiple benchmarks.