2025-09-23 Papers

1/2

Paper 1

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Published: 2025-09-22

Link: http://arxiv.org/pdf/2509.17627

1. 📘 Topic and Domain: Video editing and generation using AI, specifically focusing on mask-free video insertion of reference subjects into existing videos using diffusion transformer models.
2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and video insertion techniques that relied on masks and complex control signals, this paper introduces a novel mask-free approach with multi-stage training and specialized feature injection mechanisms.
3. ❓ Problem: The paper addresses three key challenges in mask-free video insertion: data scarcity for training, maintaining balance between subject and scene elements, and achieving natural harmonization of inserted content.
4. 🛠️ Methods: The paper introduces InsertPipe for data generation, OmniInsert framework with Condition-Specific Feature Injection, Progressive Training strategy, Subject-Focused Loss, and Context-Aware Rephraser for inference.
5. 📊 Results and Evaluation: The method outperformed commercial solutions in both quantitative metrics and user studies on their new InsertBench dataset, showing superior subject consistency, text-video alignment, and overall video quality.

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

OmniInsert: Mask-Free Video Insertion Workflow InsertPipe Data Pipeline RealCapture Pipe SynthGen Pipe SimInteract Pipe • Cross-pair videos • Detection & tracking • Video erasing • VLM filtering OmniInsert Framework Condition-Specific Feature Injection (CFI) Diffusion Transformer with LoRA Integration Video: Channel concatenation Subject: Temporal concatenation Text: Prompt embedding Multi-condition guidance Progressive Training Strategy Phase 1 Subject-to-Video Phase 2 MVI Pretraining Phase 3 Refinement Phase 4 IPO Subject-Focused Loss (SL) Insertive Preference Optimization (IPO) Flow Matching Loss Inference Pipeline Context-Aware Rephraser (CAR) Joint Classifier-Free Guidance Multi-condition Balance • Scene-aware prompt enhancement • VLM-based context understanding • Dynamic guidance scaling Evaluation & Benchmark InsertBench 120 videos + subjects Comprehensive evaluation Evaluation Metrics • Subject Consistency: CLIP-I, DINO-I, FaceSim • Text-Video Alignment: ViCLIP-T • Video Quality: Dynamic, Aesthetics, Consistency Baselines Pika-Pro Kling Superior Results Quantitative & Qualitative Key Technical Innovations Mask-Free Video Insertion Unified Framework Single/Multi Subject Subject-Scene Equilibrium Insertion Harmonization Commercial-Grade Quality
Q1
1. What is the main innovation of OmniInsert compared to previous video insertion methods?
It uses a completely new type of neural network architecture
It eliminates the need for masks while maintaining high quality insertion
It can only work with pre-defined subject categories
Q2
2. How long does it take for OmniInsert to generate a 5-second 480P video using 8 NVIDIA A100 GPUs?
About 30 seconds
About 90 seconds
About 180 seconds
Q3
3. Which component of OmniInsert helps achieve natural integration of subjects into scenes during inference?
Progressive Training (PT) strategy
Subject-Focused Loss (SL)
Context-Aware Rephraser (CAR)
1/2

Paper 2

Qwen3-Omni Technical Report

Published: 2025-09-22

Link: http://arxiv.org/pdf/2509.17765

1. 📘 Topic and Domain: A technical report introducing Qwen3-Omni, a multimodal large language model capable of processing and generating text, image, audio, and video content.
2. 💡 Previous Research and New Ideas: Based on previous Qwen models and the Thinker-Talker architecture from Qwen2.5-Omni, introducing new ideas including MoE architecture, Audio Transformer encoder, multi-codebook representation, and enhanced streaming capabilities.
3. ❓ Problem: Addressing the challenge of developing a single multimodal model that can maintain state-of-the-art performance across all modalities without degradation while enabling real-time interaction.
4. 🛠️ Methods: Implements a Thinker-Talker Mixture-of-Experts architecture with five key upgrades: MoE design, AuT encoder, multi-codebook representation, multi-track codec modeling, and reduced audio code rates for streaming.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance on 32 out of 36 audio/audiovisual benchmarks, matches single-modal performance in text and vision tasks, supports 119 written languages and multiple spoken languages, with a first-packet latency of 234ms.

Qwen3-Omni Technical Report

Qwen3-Omni Technical Workflow Architecture Design Thinker-Talker MoE AuT Audio Encoder Multi-codebook Scheme TM-RoPE Embedding ConvNet Code2Wav AuT Training 20M hours audio data 80% CN/EN ASR 10% Multi-lang ASR 10% Audio Understanding 12.5Hz token rate Pretraining (3 Stages) S1: Encoder Alignment Lock LLM Train Vision/Audio Adapters S2: General Stage 2T tokens Multimodal data All params unfrozen S3: Long Context 32K max length Long audio/video Enhanced understanding Post-training Thinker Training 1. SFT 2. Strong-to-Weak Distillation 3. GSPO Multi-modal tasks Talker Training 1. Speech mapping 2. CPT + Long context 3. DPO multilingual 4. Speaker fine-tuning Captioner Training Fine-tune on audio description dataset Low-hallucination Streaming Optimizations • Chunked Prefilling with MoE • Left-context Multi-codebook Generation • Lightweight MTP Module • End-to-end latency: 234ms Comprehensive Evaluation • Text→Text: 11 benchmarks • Audio→Text: ASR, Music, Reasoning • Vision→Text: VQA, Math, OCR • X→Speech: TTS, Multilingual Final Models Qwen3-Omni-30B-A3B-Instruct | Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-30B-A3B-Captioner | Qwen3-Omni-Flash variants 119 text languages | 19 speech input | 10 speech output | 40min audio support
Q1
1. What is the key architectural innovation that enables Qwen3-Omni to achieve low latency and high throughput?
Using a simple transformer architecture
Implementing a Thinker-Talker Mixture-of-Experts (MoE) design
Relying on traditional CNN networks
Q2
2. What is the theoretical end-to-end first-packet latency achieved by Qwen3-Omni in cold-start settings?
534 milliseconds
334 milliseconds
234 milliseconds
Q3
3. How many benchmarks did Qwen3-Omni achieve state-of-the-art performance on out of the 36 audio and audio-visual benchmarks tested?
22 benchmarks
32 benchmarks
36 benchmarks
1/2

Paper 3

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Published: 2025-09-22

Link: http://arxiv.org/pdf/2509.18056

1. 📘 Topic and Domain: The paper focuses on improving temporal video understanding in multimodal large language models (MLLMs) through a reinforcement learning framework called TempSamp-R1.
2. 💡 Previous Research and New Ideas: Based on existing Group Relative Policy Optimization (GRPO) methods, the paper proposes integrating off-policy supervision with on-policy sampling and introduces non-linear soft advantage estimation for more stable training.
3. ❓ Problem: The paper addresses the limitations of current reinforcement learning methods in temporal video grounding tasks, where large temporal search spaces make it difficult to identify accurate temporal solutions.
4. 🛠️ Methods: The authors implement TempSamp-R1 which combines on-policy generation with off-policy guidance from ground-truth annotations, uses soft advantage computation, and employs a hybrid Chain-of-Thought training paradigm.
5. 📊 Results and Evaluation: TempSamp-R1 outperformed baselines on multiple benchmarks, achieving improvements on Charades-STA (R1@0.7: 52.9%), ActivityNet Captions (R1@0.5: 56.0%), and QVHighlights (mAP: 30.0%), while showing strong few-shot generalization capabilities.

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning Video + Query Input Data (2 FPS, 2.8M pixels) Policy Model Qwen2.5-VL-7B On-Policy Sampling Off-Policy Guidance Ground Truth Mixed Solutions G-1 On-Policy + 1 Off-Policy Reward Computation • IoU Reward (Temporal) • Timestamp Matching • Format Reward (CoT) R = {r₁, r₂, ..., rG} Soft Advantage Estimation Non-linear Reward Shaping: τ + α₁·ln((rᵢ-τ) + 1) if rᵢ ≥ τ τ - (e^(α₂·(τ-rᵢ)) - 1)/(e^α₂ - 1) if rᵢ < τ A = {A₁, A₂, ..., AG} Alternative Strategies • Reward Downscaling • Advantage Anchoring • Non-linear Shaping (Best Performance) GRPO Policy Update J_GRPO = min(π_θ(o|q)/π_θ_old(o|q)·A, clip(...)·A) - β·KL(π_θ || π_ref) Hybrid CoT Training Unified Model: CoT + Non-CoT modes Phase 1: Initialization Direct answer generation without reasoning Phase 2: CoT Integration Format rewards for reasoning steps Performance Results Charades-STA: R1@0.7: 52.9% (+2.7%) ActivityNet: R1@0.5: 56.0% (+5.3%) QVHighlights: mAP: 30.0% (+3.0%) Temporal Grounding Charades-STA, ActivityNet Highlight Detection QVHighlights Key Innovations ✓ Mixed-policy sampling (on + off-policy) ✓ Non-linear soft advantage estimation ✓ Hybrid CoT training paradigm ✓ Stable training with reduced variance ✓ Superior few-shot generalization
Q1
1. What is the main limitation of existing GRPO-based methods that TempSamp-R1 aims to address?
High computational costs
Ineffective exploration in large temporal search spaces
Inability to process long videos
Q2
2. How does TempSamp-R1 incorporate off-policy guidance into its training process?
By using predictions from other models
By leveraging ground-truth annotations as external solutions
By randomly generating temporal segments
Q3
3. What unique feature of TempSamp-R1's training paradigm allows it to handle queries with varying complexity?
It uses multiple separate models for different query types
It automatically categorizes queries by difficulty
It supports both CoT and non-CoT inference modes in a single unified model