2026-01-05 Papers

1/2

Paper 1

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

Published: 2026-01-01

Link: http://arxiv.org/pdf/2601.00393

1. 📘 Topic and Domain: 4D world modeling using monocular videos for novel view synthesis and video generation.

2. 💡 Previous Research and New Ideas: Based on VGGT and 4D Gaussian Splatting approaches, proposes new scalable pipeline for processing in-the-wild monocular videos through feed-forward reconstruction and online degradation simulation.

3. ❓ Problem: Limited scalability of current 4D world modeling methods due to dependence on specialized multi-view data or cumbersome pre-processing.

4. 🛠️ Methods: Uses pose-free feed-forward 4D Gaussian reconstruction with bidirectional motion modeling, combined with online monocular degradation simulation and video diffusion generation.

5. 📊 Results and Evaluation: Achieves state-of-the-art performance on reconstruction benchmarks like VRNeRF and ScanNet++, and generation tasks compared to TrajectoryCrafter and ReCamMaster, with improved efficiency through sparse key frame sampling.

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

1/2

Paper 2

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Published: 2026-01-02

Link: http://arxiv.org/pdf/2601.00664

1. 📘 Topic and Domain: Real-time interactive head avatar generation for natural two-way conversations, in the domain of computer vision and human-computer interaction.

2. 💡 Previous Research and New Ideas: Based on diffusion models and talking head generation research, proposes new Avatar Forcing framework that enables real-time bidirectional interaction versus previous one-way systems.

3. ❓ Problem: Existing avatar systems lack true interactive communication, with challenges in real-time processing of user inputs and generating natural expressive reactions.

4. 🛠️ Methods: Uses causal diffusion forcing in motion latent space to generate real-time avatar responses, with a dual motion encoder for multimodal inputs and preference optimization to enhance expressiveness.

5. 📊 Results and Evaluation: Achieves 500ms latency for real-time interaction, with over 80% preference versus baselines in human evaluation for naturalness and responsiveness, and significantly improved motion expressiveness metrics.

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

1/2

Paper 3

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Published: 2025-12-30

Link: http://arxiv.org/pdf/2512.24615

1. 📘 Topic and Domain: The paper presents Youtu-Agent, a modular framework for automated generation and continuous optimization of Large Language Model (LLM) agents.

2. 💡 Previous Research and New Ideas: Based on existing agent frameworks like MetaGPT and AutoGen that focus on multi-agent collaboration, this paper introduces automated agent generation and continuous learning capabilities through a novel hybrid policy optimization system.

3. ❓ Problem: The paper addresses two main challenges in LLM agent development: high configuration costs requiring extensive manual effort in tool integration and prompt engineering, and static capabilities that prevent agents from adapting to dynamic environments.

4. 🛠️ Methods: The paper implements a three-layer architecture (Environment, Tools, Agent) with two generation paradigms (Workflow and Meta-Agent modes), plus two optimization components: Agent Practice for experience-based improvement and Agent RL for reinforcement learning.

5. 📊 Results and Evaluation: The framework achieved 71.47% accuracy on WebWalkerQA and 72.8% on GAIA using open-source models, demonstrated 81% tool synthesis success rate, improved AIME performance by +2.7% and +5.4% through Practice module, and achieved 40% training speedup with steady performance improvements through the RL module.