2025-12-01 Papers

1/2

Paper 1

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Published: 2025-11-28

Link: http://arxiv.org/pdf/2511.23475

1. 📘 Topic and Domain: Audio-driven multi-person talking video generation with emphasis on natural interactions between characters in generated videos.
2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and single-person talking head generation, proposing a novel identity-aware attention mechanism that can handle arbitrary number of identities with minimal training data.
3. ❓ Problem: Existing multi-person video generation methods require massive multi-person training data and struggle to create natural interactions between characters.
4. 🛠️ Methods: Introduces Audio-Face Cross Attention (AFCA) architecture for processing multiple audio-identity pairs, uses two-stage training with single-person data concatenation followed by multi-person data refinement.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance in lip synchronization, visual quality, and natural interactivity using only 12 hours of multi-person training data, evaluated using a new interactivity metric and benchmark dataset.

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

AnyTalker: Method Workflow Stage 1: Data Preparation Single-person videos (~1000h) + Concatenated multi-person data + Real multi-person data (~12h) Quality filtering with InsightFace, SyncNet, optical flow Core Architecture 3D VAE Video Tokens Wav2Vec2 Audio Tokens Audio-Face Cross Attention (AFCA) Extensible multi-stream processing Two-Stage Training Stage 1 50% Single-person 50% Concatenated Learn patterns Stage 2 Real multi-person Refine interactivity ~12h data LR: 2×10⁻⁵ → 5×10⁻⁶ AdamW optimizer 32 NVIDIA H200 GPUs Audio-Face Cross Attention Details Audio Face Concat Multi-Head Attention Face Mask Iterative processing for n identities H' = H + AFCA₁ + AFCA₂ + ... + AFCAₙ Supports arbitrary number of identities Temporal Attention Mask: 4 audio tokens → 1 video token Evaluation Framework InteractiveEyes Benchmark Dataset Interactivity Eye Motion During Listening Traditional FID, FVD Sync-C, ID Motion = (1/|S|-1) × Σ(|E_{i,j+1} - E_{i,j}|) Interactivity = (L₂×Motion_{L₂} + L₃×Motion_{L₃})/(L₂+L₃) Measures eye keypoint motion during listening periods First quantitative metric for multi-person interactivity Key Results & Contributions Scalability Arbitrary number of identities Extensible architecture Data Efficiency Only 12h multi-person data vs 100-1000h in prior work Performance SOTA interactivity scores Competitive lip sync Generalization Real humans, AIGC, animals Cross-domain capability Technical Innovations AFCA Layer Identity-aware attention Iterative processing Data Augmentation Horizontal concatenation Synthetic multi-person Face Masking Spatial localization Prevent cross-talk Evaluation Metric Eye-based interactivity Listening behavior Two-Stage Learning Pattern → Interaction Progressive refinement
Q1
1. What is the main innovation in AnyTalker's training approach that reduces data requirements?
Using only real multi-person videos for training
Horizontally concatenating single-person videos to simulate multi-person scenarios
Using AI-generated synthetic videos for training
Q2
2. How does AnyTalker evaluate the interactivity between characters in generated videos?
By measuring the audio synchronization quality
By tracking facial expressions changes
By measuring eye keypoint motion during listening periods
Q3
3. What unique capability does the Audio-Face Cross Attention (AFCA) architecture provide?
It can only handle two speakers at a time
It can scale to handle an arbitrary number of identities
It works only with pre-recorded voices
1/2

Paper 2

Vision Bridge Transformer at Scale

Published: 2025-11-28

Link: http://arxiv.org/pdf/2511.23199

1. 📘 Topic and Domain: Vision Bridge Transformer (ViBT) for large-scale vision translation tasks in computer vision, focusing on conditional image and video generation.
2. 💡 Previous Research and New Ideas: Based on Brownian Bridge Models and probability path modeling, proposing the first large-scale (20B parameters) implementation of Bridge Models with a new stabilized velocity-matching objective.
3. ❓ Problem: Addressing the inefficiency and unnatural modeling of traditional noise-to-vision approaches in conditional generation tasks by developing a more direct and efficient data-to-data translation paradigm.
4. 🛠️ Methods: Implements a transformer-based architecture with variance-stabilized velocity matching objective and variance-corrected sampling strategy, trained on paired source-target data in latent space.
5. 📊 Results and Evaluation: Achieved competitive results with traditional conditional diffusion methods while being more efficient, demonstrated strong performance across various tasks including image editing, video stylization, and depth-to-video synthesis, evaluated using multiple metrics like NIQE, MUSIQ, and CLIP Score.

Vision Bridge Transformer at Scale

Vision Bridge Transformer (ViBT) Methodology Flow Data Pairs Input (x₀, x₁) ~ p_source,target VAE Encoding Brownian Bridge Intermediate State: x_t = (1-t)x₀ + tx₁ + √t(1-t)ε Stabilized Velocity Matching Objective α(x₀,x₁,t) normalization TRAINING PROCESS Sample Compute Update t~U(0,1) Velocity Target Parameters θ INFERENCE PROCESS Init x₀ Euler-M Variance Source Discretization Correction Key Innovation: Variance-Stabilized Velocity Matching α²(x₀,x₁,t) = 1 + tD/[(1-t)||x₁-x₀||²] Prevents divergence at t→1 and balances loss across timesteps Image Editing • Instruction-based • 20B parameters • LoRA fine-tuning Video Stylization • Style transfer • Motion preservation • Temporal coherence Video Translation • Depth-to-video • 1.3B parameters • Full fine-tuning Additional Tasks • Video colorization • Frame interpolation • Image coloring Transformer Architecture Details Image Model Qwen-20B base Video Model Wan 2.1-1.3B base Optimizer Prodigy (lr=1) Training Steps 20k iterations 2-4x Faster
Q1
1. What is the main advantage of ViBT's data-to-data translation paradigm compared to traditional noise-to-vision approaches?
It requires less computational resources and memory
It provides a more natural and direct path between source and target domains
It generates higher quality images with better resolution
Q2
2. Why did the authors introduce the stabilized velocity-matching objective in ViBT?
To increase the model's parameter count
To reduce training time and memory usage
To address numerical instability and balance loss contributions across timesteps
Q3
3. What is the scale of parameters used in ViBT for different tasks?
20B for both image and video tasks
20B for image tasks and 1.3B for video tasks
1.3B for image tasks and 20B for video tasks
1/2

Paper 3

REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

Published: 2025-11-27

Link: http://arxiv.org/pdf/2511.22625

1. 📘 Topic and Domain: The paper introduces REASONEDIT, an image editing model that enhances editing capabilities through reasoning mechanisms in computer vision and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on previous multimodal large language models (MLLM) coupled with diffusion decoders for image editing, this paper proposes new thinking and reflection mechanisms to enhance instruction understanding and editing accuracy.
3. ❓ Problem: The paper addresses the limitation of current image editing models that struggle with complex or abstract instructions due to frozen MLLM encoders during training.
4. 🛠️ Methods: The authors implement a multi-stage training strategy combining an MLLM as the Reasoner and a DiT as the Generator, using thinking pairs and reflection triples datasets to train the model's reasoning capabilities.
5. 📊 Results and Evaluation: The model achieved significant performance gains over baseline models, with ReasonEdit-S improving ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%), while ReasonEdit-Q showed improvements of ImgEdit (+2.8%), GEdit (+3.4%), and Kris (+6.1%).

REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

ReasonEdit: Reasoning-Enhanced Image Editing Workflow Data Construction Thinking Pairs (Abstract → Concrete) Reflection Triples (Input → Generated → Target) 200k + 180k samples Multi-Stage Training 1. Reasoning Learning 2. Edit Learning 3. Unified Tuning Model Architecture MLLM (Reasoner) + DiT (Generator) Inference Pipeline: Thinking → Editing → Reflection Loop Input Reference Image + Abstract Instruction Thinking MLLM converts abstract instruction to concrete commands Editing DiT generates edited image using concrete instructions Reflection Multi-round pipeline: Assess → Conclude → Refine/Success Success? #Success Final Output High-quality Edited Image Refinement Generate refined editing instruction Key Innovation Details Thinking Mechanism • Abstract-to-concrete instruction pairs • 200k curated thinking pairs • Leverages world knowledge of MLLM • Handles complex/ambiguous instructions • Example: "symptoms of potassium" → "Make leaves yellow, desiccate tips" Reflection Mechanism • Multi-round single-image pipeline • 180k reflection triples (3:1:1 ratio) • Target describe → Assess → Conclude • Outputs: #Success, #Reflection, #Failed • Iterative self-correction capability • VIEScore-based quality assessment Performance Gains ReasonEdit-S improvements: • ImgEdit: +4.3% • GEdit: +4.7% • Kris: +8.2% State-of-the-art among open-source methods If #Reflection Iterative refinement
Q1
1. What is the main innovation that REASONEDIT introduces to improve image editing capabilities?
A new type of diffusion decoder architecture
Thinking and reflection mechanisms for reasoning enhancement
A larger dataset of image-text pairs
Q2
2. In the multi-stage training strategy of REASONEDIT, what component remains frozen during the Edit Learning Stage?
The DiT Generator
The MLLM Reasoner
Both components
Q3
3. What is the key limitation of current image editing models that REASONEDIT addresses?
Poor image quality in generated outputs
Slow processing speed during editing
Difficulty handling complex or abstract instructions due to frozen MLLM encoders