1. 📘 Topic and Domain: The paper introduces BabyVision, a benchmark for evaluating fundamental visual reasoning abilities in multimodal large language models (MLLMs), focusing on basic visual skills that humans develop before language acquisition.
2. 💡 Previous Research and New Ideas: Based on developmental psychology research showing humans acquire core visual skills before language, the paper proposes a new evaluation approach focused on pre-linguistic visual abilities, rather than existing benchmarks that test high-level semantic reasoning.
3. ❓ Problem: The paper addresses the gap between MLLMs' strong performance on knowledge-intensive tasks versus their weakness in basic visual tasks that even young children can solve effortlessly.
4. 🛠️ Methods: The authors created a benchmark with 388 questions across 22 subtypes in 4 categories (fine-grained discrimination, visual tracking, spatial perception, pattern recognition), evaluated leading MLLMs against human performance, and introduced BABYVISION-GEN for testing visual generation capabilities.
5. 📊 Results and Evaluation: The best model (Gemini3-Pro-Preview) achieved only 49.7% accuracy compared to human performance of 94.1%, with consistent deficits across all categories, revealing significant gaps in MLLMs' fundamental visual understanding abilities.