2025-12-01 Papers

1/2

Paper 1

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Published: 2025-11-28

Link: http://arxiv.org/pdf/2511.23475

1. 📘 Topic and Domain: Audio-driven multi-person talking video generation with emphasis on natural interactions between characters in generated videos.

2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and single-person talking head generation, proposing a novel identity-aware attention mechanism that can handle arbitrary number of identities with minimal training data.

3. ❓ Problem: Existing multi-person video generation methods require massive multi-person training data and struggle to create natural interactions between characters.

4. 🛠️ Methods: Introduces Audio-Face Cross Attention (AFCA) architecture for processing multiple audio-identity pairs, uses two-stage training with single-person data concatenation followed by multi-person data refinement.

5. 📊 Results and Evaluation: Achieves state-of-the-art performance in lip synchronization, visual quality, and natural interactivity using only 12 hours of multi-person training data, evaluated using a new interactivity metric and benchmark dataset.

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

1/2

Paper 2

Vision Bridge Transformer at Scale

Published: 2025-11-28

Link: http://arxiv.org/pdf/2511.23199

1. 📘 Topic and Domain: Vision Bridge Transformer (ViBT) for large-scale vision translation tasks in computer vision, focusing on conditional image and video generation.

2. 💡 Previous Research and New Ideas: Based on Brownian Bridge Models and probability path modeling, proposing the first large-scale (20B parameters) implementation of Bridge Models with a new stabilized velocity-matching objective.

3. ❓ Problem: Addressing the inefficiency and unnatural modeling of traditional noise-to-vision approaches in conditional generation tasks by developing a more direct and efficient data-to-data translation paradigm.

4. 🛠️ Methods: Implements a transformer-based architecture with variance-stabilized velocity matching objective and variance-corrected sampling strategy, trained on paired source-target data in latent space.

5. 📊 Results and Evaluation: Achieved competitive results with traditional conditional diffusion methods while being more efficient, demonstrated strong performance across various tasks including image editing, video stylization, and depth-to-video synthesis, evaluated using multiple metrics like NIQE, MUSIQ, and CLIP Score.

Vision Bridge Transformer at Scale

1/2

Paper 3

REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

Published: 2025-11-27

Link: http://arxiv.org/pdf/2511.22625

1. 📘 Topic and Domain: The paper introduces REASONEDIT, an image editing model that enhances editing capabilities through reasoning mechanisms in computer vision and artificial intelligence.

2. 💡 Previous Research and New Ideas: Based on previous multimodal large language models (MLLM) coupled with diffusion decoders for image editing, this paper proposes new thinking and reflection mechanisms to enhance instruction understanding and editing accuracy.

3. ❓ Problem: The paper addresses the limitation of current image editing models that struggle with complex or abstract instructions due to frozen MLLM encoders during training.

4. 🛠️ Methods: The authors implement a multi-stage training strategy combining an MLLM as the Reasoner and a DiT as the Generator, using thinking pairs and reflection triples datasets to train the model's reasoning capabilities.

5. 📊 Results and Evaluation: The model achieved significant performance gains over baseline models, with ReasonEdit-S improving ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%), while ReasonEdit-Q showed improvements of ImgEdit (+2.8%), GEdit (+3.4%), and Kris (+6.1%).