2026-01-30 Papers

1/2

Paper 1

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Published: 2026-01-29

Link: http://arxiv.org/pdf/2601.22153

1. 📘 Topic and Domain: The paper presents DynamicVLA, a Vision-Language-Action model for dynamic object manipulation in robotics.
2. 💡 Previous Research and New Ideas: Building on existing VLA models that perform well in static manipulation, the paper proposes a compact 0.4B-parameter architecture with Continuous Inference and Latent-aware Action Streaming to address temporal misalignment issues.
3. ❓ Problem: The paper solves the perception-execution gap in dynamic object manipulation where inference delays cause temporal misalignment between observation and action execution, leading to failure in manipulating moving objects.
4. 🛠️ Methods: The authors use a lightweight VLA with convolutional vision encoder (FastViT), Continuous Inference for overlapping reasoning and execution, and Latent-aware Action Streaming to enforce temporally aligned action execution, plus an automated data collection pipeline.
5. 📊 Results and Evaluation: DynamicVLA achieves 47.06% average success rate on the DOM benchmark across nine evaluation dimensions, significantly outperforming baselines (e.g., +188.1% improvement in closed-loop reactivity), with faster task completion times and better generalization to unseen objects and motion patterns.

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

DynamicVLA: Vision-Language-Action Model Workflow Model Architecture (0.4B Parameters) FastViT Encoder Convolutional 384×384 → 36 tokens Spatial compression Fast inference SmolLM2-360M First 16 layers Multimodal fusion K-V caching Action Expert Flow Matching 16 layers Action chunk n=20 Multimodal Inputs Observation: Ot = {ot-2, ot} Language instruction: Lt Proprioceptive state: Pt → Action chunk: At Key Innovations for Dynamic Manipulation Continuous Inference • Overlapping inference cycles • No inter-chunk waiting • Pipelined execution • Maintains action stream • Condition: n > m Latent-aware Action Streaming • Temporal alignment • Discard outdated actions • Prioritize recent predictions • Resolve P-E gap • Handle action conflicts Performance Metrics • 88Hz inference speed • 1.8GB GPU memory • 47.06% success rate • Real-time control Dynamic Object Manipulation (DOM) Benchmark Automated Data Collection • 200K synthetic episodes (Isaac Sim) • 2K real-world episodes (no teleoperation) • 2.8K scenes, 206 objects • Multi-embodiment: Franka & PiPER State Machine Controller S1: Approach Object → S2: Grasp & Lift → S3: Approach Target & Place → S4: Reset Real-time 6D pose & velocity tracking Predictive motion compensation (~0.23s) Interaction • Closed-loop reactivity • Dynamic adaptation • Long-horizon sequencing Perception • Visual understanding • Spatial reasoning • Motion perception Generalization • Visual generalization • Motion generalization • Disturbance robustness
Q1
1. What is the primary failure mode that DynamicVLA addresses in dynamic object manipulation?
Perceptual ambiguity due to poor vision encoder quality
Temporal misalignment between observation and action execution
Insufficient training data for moving objects
Q2
2. How does DynamicVLA's real-world data collection pipeline overcome the limitations of teleoperation for dynamic manipulation?
It uses a 'real-world simulator' with 3D object tracking to drive autonomous state-machine controllers
It employs expert human operators with specialized haptic feedback devices
It generates synthetic data using video diffusion models trained on real footage
Q3
3. What architectural choice enables DynamicVLA to achieve faster inference compared to transformer-based VLA models?
Using a larger 7B parameter language model for better compression
Implementing a convolutional vision encoder (FastViT) instead of transformer-based encoders
Removing the vision encoder entirely and relying only on language descriptions
1/2

Paper 2

Scaling Embeddings Outperforms Scaling Experts in Language Models

Published: 2026-01-28

Link: http://arxiv.org/pdf/2601.21204

1. 📘 Topic and Domain: The paper investigates scaling embeddings versus scaling experts in large language models, specifically comparing N-gram Embedding scaling against Mixture-of-Experts (MoE) scaling strategies.
2. 💡 Previous Research and New Ideas: The paper builds on existing MoE architectures and N-gram Embedding techniques, proposing that scaling embedding parameters can achieve superior efficiency compared to increasing expert numbers in high-sparsity regimes.
3. ❓ Problem: The paper addresses diminishing returns and system-level bottlenecks in MoE architectures as models scale, seeking alternative dimensions for efficient parameter scaling.
4. 🛠️ Methods: The authors conduct comparative scaling experiments across different parameter budgets, analyze architectural factors affecting embedding scaling efficacy, and implement system optimizations including N-gram Cache and speculative decoding.
5. 📊 Results and Evaluation: LongCat-Flash-Lite (68.5B parameters, 2.9B-4.5B activated) outperforms parameter-equivalent MoE baselines and shows competitive performance against existing models, particularly excelling in agentic and coding tasks across multiple benchmarks.

Scaling Embeddings Outperforms Scaling Experts in Language Models

Scaling Embeddings Outperforms Scaling Experts: Method Workflow Problem Identification MoE Diminishing Returns System Bottlenecks N-gram Embedding Base + N-gram Tables Hash Functions Comparative Analysis Embedding vs Expert Scaling Laws System Optimization N-gram Cache Kernel Fusion Key Findings & Design Principles Optimal Timing Apply N-gram when experts exceed "sweet spot" Parameter Budget Allocate ≤50% parameters to N-gram Embedding Hash Collision Mitigation Avoid integer multiples of base vocabulary size Model Architecture Better with wider models Diminishing in deeper models Hyperparameters N-gram order: 3-5 optimal Sub-tables K ≥ 2 Embedding Amplification Scaling factor or LayerNorm to preserve signal contribution LongCat-Flash-Lite Implementation 68.5B Total Parameters | 2.9B-4.5B Activated 31.4B N-gram Parameters (46% of total) 256 FFN Experts + 128 Zero-Experts Key Results ✓ Outperforms MoE baseline ✓ Superior in agentic & coding tasks ✓ Efficient inference with optimizations ✓ SWE-Bench: 54.4% accuracy ✓ Competitive with larger models ✓ Speculative decoding synergy
Q1
1. What is the key architectural insight regarding when to integrate N-gram Embedding according to the paper?
N-gram Embedding should be applied immediately when training begins to maximize parameter efficiency
N-gram Embedding should be introduced when the number of experts exceeds its 'sweet spot'
N-gram Embedding works best when applied to models with fewer than 10 experts
Q2
2. What unexpected phenomenon did the authors discover about 2-gram hashing collision rates?
Collision rates spike when vocabulary size approaches integer multiples of the base vocabulary size
Prime number vocabulary sizes eliminate all hash collisions
Larger vocabulary sizes always result in exponentially increasing collision rates
Q3
3. How does model depth affect the performance advantage of N-gram Embedding according to the experiments?
Deeper models amplify N-gram Embedding's advantages due to better feature extraction
Model depth has no significant impact on N-gram Embedding effectiveness
Increasing model depth diminishes the relative advantage of N-gram Embedding
1/2

Paper 3

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Published: 2026-01-29

Link: http://arxiv.org/pdf/2601.21639

1. 📘 Topic and Domain: The paper focuses on holistic Optical Character Recognition (OCR) that unifies both text-centric (documents, formulas, tables) and vision-centric (charts, web pages, scientific plots) recognition in the computer vision and natural language processing domain.
2. 💡 Previous Research and New Ideas: The paper builds on existing text-centric OCR methods (pipeline-based and VLM-based approaches) and vision-centric parsing techniques, proposing OCRVerse as the first end-to-end holistic OCR method that bridges character-level recognition with code-level representation through a unified framework.
3. ❓ Problem: The paper addresses the limitation that existing OCR methods focus primarily on text extraction from documents while neglecting visually information-dense sources requiring code-level representations, creating a gap in handling diverse real-world multimodal content.
4. 🛠️ Methods: The authors use comprehensive data engineering covering 15 data types and a two-stage SFT-RL training methodology, where SFT establishes cross-domain knowledge through mixed data training and RL applies personalized reward strategies for domain-specific optimization.
5. 📊 Results and Evaluation: OCRVerse achieves 89.23 overall score on OmniDocBench v1.5 for text-centric tasks and competitive performance on vision-centric benchmarks (e.g., 84.8% execution rate on ChartMimic, 76.3 on UniSVG), matching or exceeding much larger models despite having only 4B parameters.

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

OCRVerse: Holistic OCR Method Workflow Comprehensive Data Engineering Text-centric Data • Natural scenes, Books, Magazines • Papers, Reports, Slides • Exam papers, Notes, Newspapers Vision-centric Data • Charts, Webpages, Icons • Geometry, Circuits, Molecules • Code-level representations Stage 1: SFT Cross-Domain Data Mixing • Direct mixing from all 8 domains • Learn diverse visual patterns • Build unified representation space Stage 2: RL Domain-Specific Rewards • Text-centric: Rule-based rewards • Vision-centric: Visual fidelity rewards • Resolve domain conflicts OCRVerse Model (Qwen3-VL 4B) Vision Encoder Projector Language Model Holistic OCR Output Unified text-centric (character-level) and vision-centric (code-level) recognition
Q1
1. What distinguishes OCRVerse's two-stage training methodology from traditional OCR approaches?
It uses supervised fine-tuning (SFT) followed by reinforcement learning (RL) with domain-specific reward strategies
It employs only pipeline-based methods with separate detection and recognition modules
It relies exclusively on pre-trained language models without any vision components
Q2
2. Which of the following best describes the 'vision-centric OCR' capability that OCRVerse introduces?
Recognizing handwritten text in natural scene images with complex backgrounds
Converting charts, web pages, and scientific plots into executable code representations
Extracting text from scanned PDF documents with multiple columns
Q3
3. Despite having only 4B parameters, OCRVerse achieves competitive performance compared to models with up to how many parameters?
14B parameters (matching InternVL3-14B performance)
38B parameters (comparable to InternVL3.5-38B)
72B parameters (matching Qwen2.5-VL-72B on certain tasks)