2025-12-05 Papers

1/2

Paper 1

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.04677

1. 📘 Topic and Domain: Real-time audio-driven avatar video generation with infinite-length capability using diffusion models in computer vision and deep learning.
2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and DMD (Distribution Matching Distillation), introduces new concepts like Timestep-forcing Pipeline Parallelism (TPP) and Rolling Sink Frame Mechanism (RSFM) for streaming generation.
3. ❓ Problem: Addresses two key challenges in avatar generation: achieving real-time inference with large diffusion models while maintaining high fidelity, and ensuring long-term consistency in infinite-length video generation.
4. 🛠️ Methods: Implements TPP for parallel processing across GPUs, RSFM for maintaining visual consistency, and Self-Forcing Distribution Matching Distillation for model training, using a 14B-parameter diffusion model.
5. 📊 Results and Evaluation: Achieves 20 FPS on 5 H800 GPUs while maintaining high visual quality, outperforming existing methods in long-duration generation (up to 10,000 seconds) with better consistency and fidelity scores in metrics like ASE, IQA, and Sync-C.

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Live Avatar Training & Inference Framework Stage 1: Diffusion Forcing Pretraining Block-wise Noise Causal Attention Mask Flow-Matching Loss MM-DiT Model (14B Parameters) Stage 2: Self-Forcing Distribution Matching Distillation Real Score Model Fake Score Model Causal Generator KV Cache Noise Injection DMD Loss: Minimize Distribution Divergence Timestep-forcing Pipeline Parallelism (TPP) GPU Pipeline: GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 • Each GPU handles one timestep • Parallel denoising across devices • 20 FPS real-time streaming • Minimal communication overhead Rolling Sink Frame Mechanism (RSFM) Adaptive Attention Sink Replace sink with first generated frame Rolling RoPE Dynamic position alignment Long-term Consistency Benefits • Prevents identity drift • Reduces color artifacts • Maintains temporal coherence Live Avatar Performance 20 FPS Real-time 14B Parameters High Fidelity Infinite Length 10000+ seconds Streaming Real-time Key Technical Innovations Algorithm-System Co-Design: Unified training and inference optimization Timestep-forcing Pipeline: Breaking sequential diffusion bottleneck Rolling Sink Frame: Dynamic identity preservation for long videos
Q1
1. What is the main innovation of Timestep-forcing Pipeline Parallelism (TPP) in this paper?
It reduces the model size to improve speed
It assigns each GPU a fixed timestep to break sequential bottleneck
It compresses video frames to save memory
Q2
2. What was the maximum length of video generation demonstrated in the paper's experiments?
1000 seconds
5000 seconds
10000 seconds
Q3
3. Why does the paper use Adaptive Attention Sink (AAS) in their approach?
To increase the generation speed
To reduce memory usage
To prevent distribution drift and maintain visual consistency
1/2

Paper 2

Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.04987

1. 📘 Topic and Domain: The paper focuses on training large language models (LLMs) for autonomous agent capabilities through a unified ecosystem called Nex-N1, in the domain of artificial intelligence and agent systems.
2. 💡 Previous Research and New Ideas: The paper builds on previous research in LLM agent frameworks and ReAct paradigm, proposing a new unified ecosystem (NexAU, NexA4A, NexGAP) that automatically generates diverse agent environments and training data at scale.
3. ❓ Problem: The paper addresses the lack of scalable infrastructure for constructing high-quality interaction environments needed to train LLMs as effective autonomous agents rather than passive responders.
4. 🛠️ Methods: The authors developed a three-part system: NexAU (a modular runtime for agent frameworks), NexA4A (automatic generator of agents and frameworks), and NexGAP (pipeline for generating agentic training data), which together create diverse and complex interactive environments.
5. 📊 Results and Evaluation: The Nex-N1 model outperformed other open-source models on multiple benchmarks including τ2-bench, GAIA 2, and SWE-bench, while showing competitive performance against proprietary models like GPT-5 in tool use and agentic tasks.

Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

Nex-N1: Agentic Models Training Workflow NexAU Agent Universe • Modular Runtime • Recursive Architecture • Unified Interface NexA4A Agent for Agent • Framework Builder • Agent Builder • Auto Generation NexGAP General Agent Pipeline • Real MCP Tools • Query Synthesis • Quality Control Agentic Data Construction • Search-Enhanced Generation • Supervisor Tool Feedback Environment Generation • 200+ Agent Frameworks • 1-34 Node Complexity High-Quality Training Data • Diverse Interactive Trajectories • Multiple Tool-Call Formats Nex-N1 Model Training Agentic Scaling Benchmarks • SWE-bench • τ2-bench • GAIA 2 • BFCL v4 Real-world Tasks • Agentic Coding • Web Development • Deep Research • Poster Generation Framework Robustness • OpenHands • Claude Code • Terminus-2 • Cross-framework Key Results • Outperforms Open-Source • Competitive with GPT-5 • Superior Tool Use • Strong Generalization Scaling Dimensions Complexity Diversity Fidelity
Q1
1. What is the main innovation of Nex-N1 compared to traditional LLM training approaches?
It uses larger training datasets
It automatically generates diverse agent environments for training
It focuses only on coding tasks
Q2
2. Which component of the Nex ecosystem is responsible for generating agent frameworks from natural language specifications?
NexGAP
NexAU
NexA4A
Q3
3. What unique capability does the Paper2Poster Agent demonstrate?
It can only generate English posters
It can seamlessly switch between English and Chinese versions of academic posters
It only works with conference logos
1/2

Paper 3

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.05111

1. 📘 Topic and Domain: The paper presents ARM-Thinker, a multimodal reward model that incorporates tool use and visual reasoning capabilities for evaluating AI model outputs.
2. 💡 Previous Research and New Ideas: Based on existing reward models and tool-use frameworks, it introduces a novel agentic approach where the reward model actively uses tools to verify and ground its judgments rather than making passive assessments.
3. ❓ Problem: The paper addresses the limitations of current reward models that lack the ability to verify fine details, cross-reference evidence, and use tools for validation, leading to hallucination and weak visual grounding.
4. 🛠️ Methods: The authors develop a multi-stage training pipeline combining supervised fine-tuning and reinforcement learning, along with a new benchmark ARMBench-VL to evaluate tool-assisted reward modeling capabilities.
5. 📊 Results and Evaluation: ARM-Thinker achieved significant improvements over baselines: +16.2% on reward modeling benchmarks, +9.6% on tool-use tasks, and +4.2% on general reasoning benchmarks, demonstrating the effectiveness of agentic capabilities in reward models.

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

ARM-Thinker Workflow Data Gathering • LLaVA-Critic Dataset • DeepEyes (Image Tools) • MM-IFEngine (Text Tools) • MP-DocVQA (Doc Tools) • GPT-4o-mini Generation • Difficulty Filtration • Preference Pairs (r⁺, r⁻) Multi-Stage Training Pipeline SFT & Cold Start • Fine-tune Qwen2.5-VL-7B • Initialize tool behaviors • CoT trajectory data Stage 1: GRPO Tool Call Encouragement R_tool = R_f + R_try • Promote tool exploration Stage 2: GRPO Accuracy Refinement R_acc = R_f + R_a + R_succ • Focus on correctness Rollout Group of Trajectories: G = {(τᵢ, aᵢ)}ⁿᵢ₌₁ Agent Loop Think-Act-Observe THINK ACT OBSERVE Multimodal Tools Image Crop & Zoom (1) Doc Page Retrieval (2) Instruction Check (19) ARMBench-VL Evaluation Fine-Grained Perception (550 samples) Long Document QA (460 samples) Instruction Following (489 samples) Results Reward Modeling: +16.2% Tool Use: +9.6% Reasoning: +4.2% Key Achievements: • Evidence-grounded judgment • Verifiable reasoning • Interpretable rationales • Autonomous tool use Key Innovation: Agentic Multimodal Reward Model Transforms static reward scoring into active evidence-gathering through think-act-verify loops Enables fine-grained visual grounding, cross-page verification, and reasoning validation Indexed Memory Map texts_map + imgs_map Lightweight context storage
Q1
1. What is the main innovation of ARM-Thinker compared to traditional reward models?
It uses a larger model architecture with more parameters
It actively uses tools to verify and ground its judgments
It only focuses on text-based reward modeling
Q2
2. Which stage of ARM-Thinker's training pipeline comes first?
Group Relative Policy Optimization (GRPO)
Tool Call Encouragement
Supervised Fine-Tuning (SFT) with Cold Start
Q3
3. In the ARMBench-VL benchmark, which task does NOT require tool use?
Fine-grained Perception
Multimodal Long Document QA
Basic Caption Generation