1. 📘 Topic and Domain: The paper addresses the generation of simulation-ready articulated 3D assets from monolithic static meshes using multimodal large language models (MLLMs), falling within the domain of 3D computer vision, embodied AI, and physics-based simulation.
2. 💡 Previous Research and New Ideas: The paper builds upon prior work in articulated object reconstruction (e.g., ArtGS, PartField), MLLM-based kinematic reasoning (e.g., Articulate-Anything, PhysX-Anything), and 3D tokenization. Its new ideas include a unified MLLM framework that jointly performs part decomposition and kinematic prediction, and a Sparse 3D VQ-VAE that reduces token counts by 70% to overcome memory limitations of dense voxel representations.
3. ❓ Problem: The paper aims to solve the lack of "sim-ready" articulated assets, as most existing 3D meshes are static and non-decomposed, and existing multi-stage pipelines for articulated object creation suffer from accumulated errors and incompatibility between part geometry and joint predictions.
4. 🛠️ Methods: The authors propose SIMART, which uses a Sparse 3D VQ-VAE for efficient geometric tokenization and a Qwen3-VL-based MLLM backbone to jointly perform part-level mesh decomposition and kinematic parameter (joint type, axis, limits) prediction, outputting URDF specifications and segmented meshes.
5. 📊 Results and Evaluation: SIMART achieves state-of-the-art performance on PartNet-Mobility and a newly curated AI-generated benchmark (SIMART-Bench), outperforming baselines like Urdformer, Articulate-Anything, and PhysX-Anything in joint classification accuracy (Type↑), axis error (Axis↓), origin error (Origin↓), part IoU (↑), and Chamfer distance (CD↓), while enabling physics-based robotic simulation and VR/AR applications.