1. 📘 Topic and Domain: Visual Jigsaw is a self-supervised post-training framework for improving visual understanding capabilities in multimodal large language models (MLLMs) across image, video and 3D modalities.
2. 💡 Previous Research and New Ideas: Based on previous work in self-supervised visual learning and MLLM post-training, it proposes a novel jigsaw-style task that enhances visual perception without requiring additional model architecture changes.
3. ❓ Problem: Current MLLM post-training approaches are predominantly text-centric and undervalue deep visual understanding, while existing visual enhancement methods require architectural changes or additional generative components.
4. 🛠️ Methods: Implements visual jigsaw tasks where inputs are partitioned, shuffled, and the model must reconstruct the correct order using natural language, applied across three modalities: image patches, video clips, and 3D depth points.
5. 📊 Results and Evaluation: Achieved significant improvements across multiple benchmarks: enhanced fine-grained perception and spatial understanding in images, improved temporal reasoning in videos, and better 3D spatial comprehension, while maintaining the model's original reasoning capabilities.