1. 📘 Topic and Domain: The paper focuses on evaluating and testing the visual perception capabilities of Multimodal Large Language Models (MLLMs) through a new benchmark called Turing Eye Test (TET).
2. 💡 Previous Research and New Ideas: Previous research focused on reasoning capabilities of MLLMs, while this paper proposes a novel approach by shifting focus to testing fundamental visual perception abilities through specialized perceptual tasks.
3. ❓ Problem: The paper addresses whether MLLMs can truly perceive visual information like humans do, revealing a fundamental gap between machine and human perception capabilities.
4. 🛠️ Methods: The authors created four diagnostic tasks (HiddenText, 3DCaptcha, ColorBlind, and ChineseLigatures) and evaluated 15 state-of-the-art MLLMs using Pass@1 and Pass@K metrics, along with analyzing model behavior through Grad-CAM visualization.
5. 📊 Results and Evaluation: Results showed catastrophic failures of current MLLMs on these perceptual tasks, with most models achieving near-zero success rates, while fine-tuning the vision tower enabled rapid adaptation, suggesting the limitation lies in visual perception rather than reasoning capabilities.