1. 📘 Topic and Domain: The paper addresses spatial reasoning in vision-language models (VLMs) by enabling them to interact with 3D reconstructed environments rather than relying solely on 2D image perception.
2. 💡 Previous Research and New Ideas: Building on prior "think with image" approaches that use 2D tools (zoom, crop, depth estimation), the paper introduces "think with space" - allowing VLMs to actively manipulate 3D point clouds reconstructed from multi-view images, using camera poses as spatial anchors for coherent 3D exploration.
3. ❓ Problem: Current VLMs struggle with genuine 3D reasoning tasks (multi-view understanding, route planning) because they remain fundamentally 2D perceivers, unable to build consistent 3D representations needed for spatial intelligence.
4. 🛠️ Methods: Think3D uses a 3D manipulation toolkit (Pi3 for reconstruction, camera-based transformations, novel view rendering) enabling iterative observe→manipulate→reflect loops, plus Think3D-RL which trains smaller models via reinforcement learning (GRPO) to learn effective viewpoint selection strategies.
5. 📊 Results and Evaluation: On BLINK Multi-view and MindCube, Think3D achieves +7.8% average gain for GPT-4.1/Gemini-2.5-Pro and +4.7% on VSI-Bench; with RL training, smaller models improve from +0.7% to +6.8% benefit from spatial exploration, demonstrating that learned exploration policies significantly enhance 3D reasoning capabilities.