1. 📘 Topic and Domain: The paper focuses on evaluating Optical Character Recognition (OCR) capabilities of Multimodal Large Language Models (MLLMs) in video scenarios.
2. 💡 Previous Research and New Ideas: Previous research mainly focused on OCR in static images, while this paper introduces a comprehensive benchmark for video OCR tasks and proposes new evaluation methods for dynamic text recognition.
3. ❓ Problem: The paper addresses the challenge of evaluating MLLMs' ability to recognize, understand, and reason about text in videos, which is more complex than static image OCR due to motion blur, temporal variations, and visual effects.
4. 🛠️ Methods: The authors created MME-VideoOCR benchmark with 1,464 videos and 2,000 manually annotated question-answer pairs across 25 tasks in 10 categories, evaluating 18 state-of-the-art MLLMs using containment match, GPT-assisted scoring, and multiple-choice evaluation methods.
5. 📊 Results and Evaluation: The best-performing model (Gemini-2.5 Pro) achieved 73.7% accuracy, while most models struggled with tasks requiring spatio-temporal reasoning and cross-frame information integration, highlighting the need for improved video OCR capabilities in MLLMs.