2026-01-23 Papers

1/2

Paper 1

Toward Efficient Agents: Memory, Tool learning, and Planning

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.14192

1. 📘 Topic and Domain: This paper surveys efficient agents in the domain of Large Language Model (LLM)-based autonomous systems, focusing on memory, tool learning, and planning components.
2. 💡 Previous Research and New Ideas: The paper builds on existing LLM agent research but identifies that while effectiveness has improved, efficiency (latency, token consumption, computational cost) has been overlooked; it proposes a comprehensive framework analyzing efficiency across memory, tool learning, and planning components.
3. ❓ Problem: The paper addresses the critical efficiency bottleneck in LLM-based agents, where recursive multi-step execution leads to exponentially growing resource consumption through token accumulation, context window saturation, and excessive computational costs.
4. 🛠️ Methods: The authors conduct a systematic literature review categorizing efficiency techniques into three core components: efficient memory (construction, management, access), efficient tool learning (selection, calling, reasoning), and efficient planning (single-agent and multi-agent strategies).
5. 📊 Results and Evaluation: The survey synthesizes efficiency metrics across benchmarks and methods, revealing common principles like context compression, reinforcement learning for minimizing tool invocation, and controlled search mechanisms, while identifying gaps in standardized efficiency evaluation frameworks.

Toward Efficient Agents: Memory, Tool learning, and Planning

Efficient Agents: Memory, Tool Learning & Planning Memory Construction Management Access Tool Learning Selection Calling Tool-Integrated Reasoning Planning Single-Agent Planning Multi-Agent Collaboration Efficiency Trade-offs LLM Agent Efficiency Metrics Token Usage Latency API Cost Memory Usage Interaction Steps Benchmarks • MemBench, StoryBench (Memory) • ToolBench, T-Eval (Tool Learning) • TPS-Bench, CostBench (Planning) • Cost-of-Pass Metrics Key Insights • Compression vs Performance • Online vs Offline Processing • Cost-aware Optimization • Multi-agent Coordination Future Directions • Unified Evaluation • Latent Reasoning • MLLM Efficiency • Deployment-aware Design
Q1
1. According to the paper, what is the fundamental difference in cost structure between pure LLMs and LLM-based agents?
Agents only differ in having higher token generation costs due to longer contexts
Agents incur additional overhead from tools, memory access, and retries beyond just token generation
Agents are actually more cost-efficient because they can cache previous computations
Q2
2. What does the paper identify as a key efficiency strategy in multi-agent systems to avoid quadratic communication costs?
Using larger language models that can process more information simultaneously
Implementing structured topologies like chains or DAGs to achieve near-linear complexity
Forcing all agents to communicate through a centralized database
Q3
3. How does the paper define 'latent memory' in the context of efficient agent memory systems?
Memory that is stored in external databases and retrieved on-demand
Textual summaries that are hidden from the user but visible to the agent
Continuous signals like KV caches or hidden states that influence computation without being represented as tokens
1/2

Paper 2

GutenOCR: A Grounded Vision-Language Front-End for Documents

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.14490

1. 📘 Topic and Domain: The paper presents GutenOCR, a grounded vision-language model for optical character recognition (OCR) in documents, focusing on unified text reading, detection, and localization through a single checkpoint.
2. 💡 Previous Research and New Ideas: The paper builds on Qwen2.5-VL models and classical OCR pipelines, proposing a new approach that combines VLM flexibility with traditional OCR's explicit grounding capabilities through prompt-based interfaces for reading, detection, and conditional localization.
3. ❓ Problem: The paper addresses the lack of grounded OCR front-ends that provide both high-quality text extraction and fine-grained control over how documents are read, including explicit links between tokens and pixels for production systems.
4. 🛠️ Methods: The authors fine-tune Qwen2.5-VL-3B/7B models using a curriculum-based training approach on business documents, scientific articles, and synthetic data, exposing multiple OCR tasks through unified prompt-based schemas without modifying the architecture.
5. 📊 Results and Evaluation: GutenOCR-7B more than doubles the composite grounded OCR score of its backbone (0.40→0.82) on held-out pages, substantially improves region and line-level OCR on Fox benchmark, but shows trade-offs in formula recognition and color-guided tasks.

GutenOCR: A Grounded Vision-Language Front-End for Documents

GutenOCR: Grounded Vision-Language OCR Pipeline Data Sources • OCR-IDL (26M pages) • TabMe++ (122K pages) • PubMed-OCR (1.5M pages) Synthetic Data • Grounded LaTeX (3M) • SynthDoG (1.2M) Base Models Qwen2.5-VL-3B/7B 4-Stage Training Curriculum Stage 1: Core (<2k tokens) Stage 2: Real spec. (2k-8k) Stage 3a/3b: PMD (8k-16k) Task Families Full-page Reading text, text2d, lines, paragraphs Detection lines, paragraphs, math regions Conditional Detection Image + query → BOX(q) Localized Reading Image + box → text GutenOCR Models Unified grounded OCR front-end • Prompt-based interface • Text + bounding boxes Evaluation Protocol Text Metrics: CER, WER Detection: F1@0.5, Recall End-to-end: CERe2e, mCER Benchmarks: In-domain, Fox, OmniDocBench v1.5
Q1
1. What is the primary innovation of GutenOCR compared to existing OCR approaches?
It achieves 100% accuracy on all document types by using a larger model size
It provides a unified checkpoint that supports reading, detection, and grounding through prompt-based interfaces
It completely replaces traditional OCR with pure vision transformers without any text output
Q2
2. What critical limitation does GutenOCR exhibit according to the evaluation results?
It cannot process documents longer than one page
It performs poorly on color-guided OCR tasks due to catastrophic forgetting
It only works with English text and fails on all other languages
Q3
3. Why does GutenOCR achieve high Page F1 but poor Page CER on the Fox benchmark?
The model hallucinates extra text that doesn't exist in the original document
It reads the correct content but follows a layout-driven order that differs from Fox's canonical linearization
The evaluation metrics are broken and produce inconsistent results
1/2

Paper 3

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.14251

1. 📘 Topic and Domain: The paper presents LightOnOCR-2-1B, a compact end-to-end multilingual vision-language model for state-of-the-art optical character recognition (OCR) in document understanding.
2. 💡 Previous Research and New Ideas: The paper builds on traditional OCR pipelines and recent vision-language models like Nougat, olmOCR, and Qwen-VL, proposing a unified 1B-parameter model that eliminates brittle multi-stage pipelines and adds image bounding box localization capabilities.
3. ❓ Problem: The paper aims to solve the complexity and brittleness of traditional multi-stage OCR pipelines that require coordinated changes across components when adapting to new document distributions.
4. 🛠️ Methods: The authors use supervised pretraining on 43M document pages with a stronger teacher model (Qwen3-VL-235B), higher resolution training (1540px), reinforcement learning with verifiable rewards (RLVR), and weight-space techniques like checkpoint averaging and task-arithmetic merging.
5. 📊 Results and Evaluation: LightOnOCR-2-1B achieves state-of-the-art performance on OlmOCR-Bench (83.2% overall score) while being 9× smaller than prior best models, with 5.7× higher inference throughput and successful image localization (F1@0.5: 0.78-0.83) on their new LightOnOCR-bbox-bench.

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

LightOnOCR-2-1B Training Workflow Data Collection 43M pages (2.5x scale) PDFA, scans, arXiv Teacher Distillation Qwen3-VL-235B Markdown + LaTeX nvpdftex Pipeline arXiv TEX sources Pixel-aligned annotations Normalization Clean LaTeX, remove artifacts Deduplication, filtering 1B Architecture: Vision Encoder (Mistral) + MLP Projector + LM Decoder (Qwen3) Native resolution, spatial merging 2x2, max 1540px longest edge Supervised Pretraining Next-token prediction, AdamW lr=1e-4, batch=384 Data augmentations: erosion, rotation, distortion Checkpoint Averaging (last 5) → LightOnOCR-2-1B-base OCR RLVR Unit tests, repetition penalty, KaTeX validation GRPO, lr=4e-5, β=0.01 Bbox RLVR Resume with coord supervision IoU-based rewards LightOnOCR-2-1B (Best OCR) LightOnOCR-2-1B-bbox (With localization) Task Arithmetic Merging (OCR-bbox trade-off) LightOnOCR-2-1B-soup (Merged variants)
Q1
1. What unique training strategy does LightOnOCR-2 employ to add image localization capabilities without degrading OCR quality?
Training separate models for OCR and localization, then ensemble them at inference
Introducing coordinate supervision during pretraining via a resume strategy, then refining with RLVR using IoU-based rewards
Fine-tuning on synthetic data with artificially generated bounding boxes
Q2
2. How does LightOnOCR-2's performance compare to larger models on OlmOCR-Bench?
It achieves 83.2% overall score while being 9× smaller than prior best-performing models
It performs 20% worse but compensates with 50× faster inference speed
It matches performance only on English documents but fails on multilingual content
Q3
3. What innovative data curation technique does the paper introduce for obtaining high-quality supervision from scientific documents?
Using GPT-4 to manually annotate each arXiv paper with ground truth labels
Implementing an nvpdftex-based pipeline that hooks into the pdfLaTeX engine to produce pixel-aligned annotations
Converting all scientific PDFs to HTML first, then extracting text using web scraping tools