2026-01-23 Papers

1/2

Paper 1

Toward Efficient Agents: Memory, Tool learning, and Planning

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.14192

1. 📘 Topic and Domain: This paper surveys efficient agents in the domain of Large Language Model (LLM)-based autonomous systems, focusing on memory, tool learning, and planning components.

2. 💡 Previous Research and New Ideas: The paper builds on existing LLM agent research but identifies that while effectiveness has improved, efficiency (latency, token consumption, computational cost) has been overlooked; it proposes a comprehensive framework analyzing efficiency across memory, tool learning, and planning components.

3. ❓ Problem: The paper addresses the critical efficiency bottleneck in LLM-based agents, where recursive multi-step execution leads to exponentially growing resource consumption through token accumulation, context window saturation, and excessive computational costs.

4. 🛠️ Methods: The authors conduct a systematic literature review categorizing efficiency techniques into three core components: efficient memory (construction, management, access), efficient tool learning (selection, calling, reasoning), and efficient planning (single-agent and multi-agent strategies).

5. 📊 Results and Evaluation: The survey synthesizes efficiency metrics across benchmarks and methods, revealing common principles like context compression, reinforcement learning for minimizing tool invocation, and controlled search mechanisms, while identifying gaps in standardized efficiency evaluation frameworks.

Toward Efficient Agents: Memory, Tool learning, and Planning

1/2

Paper 2

GutenOCR: A Grounded Vision-Language Front-End for Documents

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.14490

1. 📘 Topic and Domain: The paper presents GutenOCR, a grounded vision-language model for optical character recognition (OCR) in documents, focusing on unified text reading, detection, and localization through a single checkpoint.

2. 💡 Previous Research and New Ideas: The paper builds on Qwen2.5-VL models and classical OCR pipelines, proposing a new approach that combines VLM flexibility with traditional OCR's explicit grounding capabilities through prompt-based interfaces for reading, detection, and conditional localization.

3. ❓ Problem: The paper addresses the lack of grounded OCR front-ends that provide both high-quality text extraction and fine-grained control over how documents are read, including explicit links between tokens and pixels for production systems.

4. 🛠️ Methods: The authors fine-tune Qwen2.5-VL-3B/7B models using a curriculum-based training approach on business documents, scientific articles, and synthetic data, exposing multiple OCR tasks through unified prompt-based schemas without modifying the architecture.

5. 📊 Results and Evaluation: GutenOCR-7B more than doubles the composite grounded OCR score of its backbone (0.40→0.82) on held-out pages, substantially improves region and line-level OCR on Fox benchmark, but shows trade-offs in formula recognition and color-guided tasks.

GutenOCR: A Grounded Vision-Language Front-End for Documents

1/2

Paper 3

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.14251

1. 📘 Topic and Domain: The paper presents LightOnOCR-2-1B, a compact end-to-end multilingual vision-language model for state-of-the-art optical character recognition (OCR) in document understanding.

2. 💡 Previous Research and New Ideas: The paper builds on traditional OCR pipelines and recent vision-language models like Nougat, olmOCR, and Qwen-VL, proposing a unified 1B-parameter model that eliminates brittle multi-stage pipelines and adds image bounding box localization capabilities.

3. ❓ Problem: The paper aims to solve the complexity and brittleness of traditional multi-stage OCR pipelines that require coordinated changes across components when adapting to new document distributions.

4. 🛠️ Methods: The authors use supervised pretraining on 43M document pages with a stronger teacher model (Qwen3-VL-235B), higher resolution training (1540px), reinforcement learning with verifiable rewards (RLVR), and weight-space techniques like checkpoint averaging and task-arithmetic merging.

5. 📊 Results and Evaluation: LightOnOCR-2-1B achieves state-of-the-art performance on OlmOCR-Bench (83.2% overall score) while being 9× smaller than prior best models, with 5.7× higher inference throughput and successful image localization (F1@0.5: 0.78-0.83) on their new LightOnOCR-bbox-bench.