2025-12-05 Papers

1/2

Paper 1

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.04677

1. 📘 Topic and Domain: Real-time audio-driven avatar video generation with infinite-length capability using diffusion models in computer vision and deep learning.

2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and DMD (Distribution Matching Distillation), introduces new concepts like Timestep-forcing Pipeline Parallelism (TPP) and Rolling Sink Frame Mechanism (RSFM) for streaming generation.

3. ❓ Problem: Addresses two key challenges in avatar generation: achieving real-time inference with large diffusion models while maintaining high fidelity, and ensuring long-term consistency in infinite-length video generation.

4. 🛠️ Methods: Implements TPP for parallel processing across GPUs, RSFM for maintaining visual consistency, and Self-Forcing Distribution Matching Distillation for model training, using a 14B-parameter diffusion model.

5. 📊 Results and Evaluation: Achieves 20 FPS on 5 H800 GPUs while maintaining high visual quality, outperforming existing methods in long-duration generation (up to 10,000 seconds) with better consistency and fidelity scores in metrics like ASE, IQA, and Sync-C.

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

1/2

Paper 2

Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.04987

1. 📘 Topic and Domain: The paper focuses on training large language models (LLMs) for autonomous agent capabilities through a unified ecosystem called Nex-N1, in the domain of artificial intelligence and agent systems.

2. 💡 Previous Research and New Ideas: The paper builds on previous research in LLM agent frameworks and ReAct paradigm, proposing a new unified ecosystem (NexAU, NexA4A, NexGAP) that automatically generates diverse agent environments and training data at scale.

3. ❓ Problem: The paper addresses the lack of scalable infrastructure for constructing high-quality interaction environments needed to train LLMs as effective autonomous agents rather than passive responders.

4. 🛠️ Methods: The authors developed a three-part system: NexAU (a modular runtime for agent frameworks), NexA4A (automatic generator of agents and frameworks), and NexGAP (pipeline for generating agentic training data), which together create diverse and complex interactive environments.

5. 📊 Results and Evaluation: The Nex-N1 model outperformed other open-source models on multiple benchmarks including τ2-bench, GAIA 2, and SWE-bench, while showing competitive performance against proprietary models like GPT-5 in tool use and agentic tasks.

Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

1/2

Paper 3

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.05111

1. 📘 Topic and Domain: The paper presents ARM-Thinker, a multimodal reward model that incorporates tool use and visual reasoning capabilities for evaluating AI model outputs.

2. 💡 Previous Research and New Ideas: Based on existing reward models and tool-use frameworks, it introduces a novel agentic approach where the reward model actively uses tools to verify and ground its judgments rather than making passive assessments.

3. ❓ Problem: The paper addresses the limitations of current reward models that lack the ability to verify fine details, cross-reference evidence, and use tools for validation, leading to hallucination and weak visual grounding.

4. 🛠️ Methods: The authors develop a multi-stage training pipeline combining supervised fine-tuning and reinforcement learning, along with a new benchmark ARMBench-VL to evaluate tool-assisted reward modeling capabilities.

5. 📊 Results and Evaluation: ARM-Thinker achieved significant improvements over baselines: +16.2% on reward modeling benchmarks, +9.6% on tool-use tasks, and +4.2% on general reasoning benchmarks, demonstrating the effectiveness of agentic capabilities in reward models.