1. 📘 Topic and Domain: The paper presents LingBot-World, an open-source world model for interactive video generation that bridges video synthesis and actionable simulation in computer vision and machine learning.
2. 💡 Previous Research and New Ideas: Building on video generation models and world simulators like Genie 3 and Wan2.2, the paper proposes a multi-stage evolution strategy (pre-training, middle-training, post-training) with hierarchical data captioning and mixture-of-experts architecture for long-term consistency.
3. ❓ Problem: The paper addresses the challenge of transitioning from passive video generation to interactive world simulation, tackling issues of scarce interactive data, maintaining long-term temporal coherence, and achieving real-time controllable generation.
4. 🛠️ Methods: The authors employ a scalable data engine with game/synthetic data acquisition, progressive curriculum training with MoE architecture, and causal architecture adaptation with few-step distillation for real-time inference.
5. 📊 Results and Evaluation: LingBot-World achieves superior performance on VBench metrics (0.8857 dynamic degree vs 0.7612/0.7217 for baselines), maintains minute-level temporal consistency, supports real-time interaction at 16fps, and demonstrates emergent spatial memory and 3D consistency capabilities.