1. 📘 Topic and Domain: The paper focuses on holistic Optical Character Recognition (OCR) that unifies both text-centric (documents, formulas, tables) and vision-centric (charts, web pages, scientific plots) recognition in the computer vision and natural language processing domain.
2. 💡 Previous Research and New Ideas: The paper builds on existing text-centric OCR methods (pipeline-based and VLM-based approaches) and vision-centric parsing techniques, proposing OCRVerse as the first end-to-end holistic OCR method that bridges character-level recognition with code-level representation through a unified framework.
3. ❓ Problem: The paper addresses the limitation that existing OCR methods focus primarily on text extraction from documents while neglecting visually information-dense sources requiring code-level representations, creating a gap in handling diverse real-world multimodal content.
4. 🛠️ Methods: The authors use comprehensive data engineering covering 15 data types and a two-stage SFT-RL training methodology, where SFT establishes cross-domain knowledge through mixed data training and RL applies personalized reward strategies for domain-specific optimization.
5. 📊 Results and Evaluation: OCRVerse achieves 89.23 overall score on OmniDocBench v1.5 for text-centric tasks and competitive performance on vision-centric benchmarks (e.g., 84.8% execution rate on ChartMimic, 76.3 on UniSVG), matching or exceeding much larger models despite having only 4B parameters.