2025-09-23 Papers

1/2

Paper 1

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Published: 2025-09-22

Link: http://arxiv.org/pdf/2509.17627

1. 📘 Topic and Domain: Video editing and generation using AI, specifically focusing on mask-free video insertion of reference subjects into existing videos using diffusion transformer models.

2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and video insertion techniques that relied on masks and complex control signals, this paper introduces a novel mask-free approach with multi-stage training and specialized feature injection mechanisms.

3. ❓ Problem: The paper addresses three key challenges in mask-free video insertion: data scarcity for training, maintaining balance between subject and scene elements, and achieving natural harmonization of inserted content.

4. 🛠️ Methods: The paper introduces InsertPipe for data generation, OmniInsert framework with Condition-Specific Feature Injection, Progressive Training strategy, Subject-Focused Loss, and Context-Aware Rephraser for inference.

5. 📊 Results and Evaluation: The method outperformed commercial solutions in both quantitative metrics and user studies on their new InsertBench dataset, showing superior subject consistency, text-video alignment, and overall video quality.

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

1/2

Paper 2

Qwen3-Omni Technical Report

Published: 2025-09-22

Link: http://arxiv.org/pdf/2509.17765

1. 📘 Topic and Domain: A technical report introducing Qwen3-Omni, a multimodal large language model capable of processing and generating text, image, audio, and video content.

2. 💡 Previous Research and New Ideas: Based on previous Qwen models and the Thinker-Talker architecture from Qwen2.5-Omni, introducing new ideas including MoE architecture, Audio Transformer encoder, multi-codebook representation, and enhanced streaming capabilities.

3. ❓ Problem: Addressing the challenge of developing a single multimodal model that can maintain state-of-the-art performance across all modalities without degradation while enabling real-time interaction.

4. 🛠️ Methods: Implements a Thinker-Talker Mixture-of-Experts architecture with five key upgrades: MoE design, AuT encoder, multi-codebook representation, multi-track codec modeling, and reduced audio code rates for streaming.

5. 📊 Results and Evaluation: Achieves state-of-the-art performance on 32 out of 36 audio/audiovisual benchmarks, matches single-modal performance in text and vision tasks, supports 119 written languages and multiple spoken languages, with a first-packet latency of 234ms.

Qwen3-Omni Technical Report

1/2

Paper 3

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Published: 2025-09-22

Link: http://arxiv.org/pdf/2509.18056

1. 📘 Topic and Domain: The paper focuses on improving temporal video understanding in multimodal large language models (MLLMs) through a reinforcement learning framework called TempSamp-R1.

2. 💡 Previous Research and New Ideas: Based on existing Group Relative Policy Optimization (GRPO) methods, the paper proposes integrating off-policy supervision with on-policy sampling and introduces non-linear soft advantage estimation for more stable training.

3. ❓ Problem: The paper addresses the limitations of current reinforcement learning methods in temporal video grounding tasks, where large temporal search spaces make it difficult to identify accurate temporal solutions.

4. 🛠️ Methods: The authors implement TempSamp-R1 which combines on-policy generation with off-policy guidance from ground-truth annotations, uses soft advantage computation, and employs a hybrid Chain-of-Thought training paradigm.

5. 📊 Results and Evaluation: TempSamp-R1 outperformed baselines on multiple benchmarks, achieving improvements on Charades-STA (R1@0.7: 52.9%), ActivityNet Captions (R1@0.5: 56.0%), and QVHighlights (mAP: 30.0%), while showing strong few-shot generalization capabilities.