2026-01-30 Papers

1/2

Paper 1

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Published: 2026-01-29

Link: http://arxiv.org/pdf/2601.22153

1. 📘 Topic and Domain: The paper presents DynamicVLA, a Vision-Language-Action model for dynamic object manipulation in robotics.

2. 💡 Previous Research and New Ideas: Building on existing VLA models that perform well in static manipulation, the paper proposes a compact 0.4B-parameter architecture with Continuous Inference and Latent-aware Action Streaming to address temporal misalignment issues.

3. ❓ Problem: The paper solves the perception-execution gap in dynamic object manipulation where inference delays cause temporal misalignment between observation and action execution, leading to failure in manipulating moving objects.

4. 🛠️ Methods: The authors use a lightweight VLA with convolutional vision encoder (FastViT), Continuous Inference for overlapping reasoning and execution, and Latent-aware Action Streaming to enforce temporally aligned action execution, plus an automated data collection pipeline.

5. 📊 Results and Evaluation: DynamicVLA achieves 47.06% average success rate on the DOM benchmark across nine evaluation dimensions, significantly outperforming baselines (e.g., +188.1% improvement in closed-loop reactivity), with faster task completion times and better generalization to unseen objects and motion patterns.

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

1/2

Paper 2

Scaling Embeddings Outperforms Scaling Experts in Language Models

Published: 2026-01-28

Link: http://arxiv.org/pdf/2601.21204

1. 📘 Topic and Domain: The paper investigates scaling embeddings versus scaling experts in large language models, specifically comparing N-gram Embedding scaling against Mixture-of-Experts (MoE) scaling strategies.

2. 💡 Previous Research and New Ideas: The paper builds on existing MoE architectures and N-gram Embedding techniques, proposing that scaling embedding parameters can achieve superior efficiency compared to increasing expert numbers in high-sparsity regimes.

3. ❓ Problem: The paper addresses diminishing returns and system-level bottlenecks in MoE architectures as models scale, seeking alternative dimensions for efficient parameter scaling.

4. 🛠️ Methods: The authors conduct comparative scaling experiments across different parameter budgets, analyze architectural factors affecting embedding scaling efficacy, and implement system optimizations including N-gram Cache and speculative decoding.

5. 📊 Results and Evaluation: LongCat-Flash-Lite (68.5B parameters, 2.9B-4.5B activated) outperforms parameter-equivalent MoE baselines and shows competitive performance against existing models, particularly excelling in agentic and coding tasks across multiple benchmarks.

Scaling Embeddings Outperforms Scaling Experts in Language Models

1/2

Paper 3

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Published: 2026-01-29

Link: http://arxiv.org/pdf/2601.21639

1. 📘 Topic and Domain: The paper focuses on holistic Optical Character Recognition (OCR) that unifies both text-centric (documents, formulas, tables) and vision-centric (charts, web pages, scientific plots) recognition in the computer vision and natural language processing domain.

2. 💡 Previous Research and New Ideas: The paper builds on existing text-centric OCR methods (pipeline-based and VLM-based approaches) and vision-centric parsing techniques, proposing OCRVerse as the first end-to-end holistic OCR method that bridges character-level recognition with code-level representation through a unified framework.

3. ❓ Problem: The paper addresses the limitation that existing OCR methods focus primarily on text extraction from documents while neglecting visually information-dense sources requiring code-level representations, creating a gap in handling diverse real-world multimodal content.

4. 🛠️ Methods: The authors use comprehensive data engineering covering 15 data types and a two-stage SFT-RL training methodology, where SFT establishes cross-domain knowledge through mixed data training and RL applies personalized reward strategies for domain-specific optimization.

5. 📊 Results and Evaluation: OCRVerse achieves 89.23 overall score on OmniDocBench v1.5 for text-centric tasks and competitive performance on vision-centric benchmarks (e.g., 84.8% execution rate on ChartMimic, 76.3 on UniSVG), matching or exceeding much larger models despite having only 4B parameters.