2025-10-10 Papers

1/2

Paper 1

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Published: 2025-10-09

Link: http://arxiv.org/pdf/2510.08555

1. 📘 Topic and Domain: Video generation and completion, specifically focusing on unified video synthesis from arbitrary spatiotemporal patches using diffusion models.
2. 💡 Previous Research and New Ideas: Based on previous work in controllable video generation and In-Context Conditioning (ICC), introduces a novel framework that unifies various video generation tasks under a single paradigm.
3. ❓ Problem: Addresses the challenge of generating coherent videos from arbitrary patches placed at any spatial location and timestamp, while resolving temporal ambiguity in causal VAEs.
4. 🛠️ Methods: Employs a hybrid conditioning strategy combining Spatial Zero-Padding and Temporal RoPE Interpolation within an In-Context Conditioning framework, requiring zero new parameters.
5. 📊 Results and Evaluation: Outperformed existing conditioning paradigms across multiple metrics in VideoCanvasBench, showing superior performance in visual quality, temporal coherence, and dynamic degree, with significantly higher user preference scores (60-70% vs 25-30% for baselines).

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

VideoCanvas Methodology Flow Input Conditions P = {(p_i, m_i, t_i)} Patches at arbitrary spatial-temporal locations Spatial Conditioning Zero-Padding x_prep,i = m_i ⊙ p_i Fill remaining with zeros VAE Encoding Temporal Decoupling z_cond,i = E(x_prep,i) Independent encoding Core Challenge Temporal Ambiguity in Causal VAE Multiple frames → Single latent slot Our Solution Temporal RoPE Interpolation pos_t(z_cond,i) = t_i / N Sequence Construction In-Context Conditioning z = Concat({z_cond,i}, z_source) Training Flow Matching Loss on non-conditional regions only Output Complete Video Arbitrary spatio-temporal completion Unified Applications Any-timestamp I2V Any-timestamp P2V Video Transition Inpainting Outpainting Key Innovation Zero new parameters + Frozen VAE
Q1
1. What is the main challenge addressed by the VideoCanvas framework regarding causal VAEs?
High computational cost of video generation
Temporal ambiguity when mapping multiple frames to a single latent representation
Limited storage capacity for video data
Q2
2. Which innovative combination does the paper's hybrid conditioning strategy use?
Channel Concatenation and Latent Replacement
Cross-Attention and Channel Injection
Spatial Zero-Padding and Temporal RoPE Interpolation
Q3
3. What unique advantage does VideoCanvas offer compared to previous video generation approaches?
It requires extensive model retraining for each new task
It can only handle first-frame video generation
It unifies multiple video tasks in one framework with zero new parameters
1/2

Paper 2

UniVideo: Unified Understanding, Generation, and Editing for Videos

Published: 2025-10-09

Link: http://arxiv.org/pdf/2510.08377

1. 📘 Topic and Domain: A unified AI framework called UniVideo for video understanding, generation, and editing that combines multimodal capabilities in a single system.
2. 💡 Previous Research and New Ideas: Based on previous unified text-image models and task-specific video models, proposes a novel dual-stream architecture combining a Multimodal Large Language Model (MLLM) for understanding with a Multimodal DiT (MMDiT) for generation.
3. ❓ Problem: Addresses the limitation of current video AI models being restricted to single tasks or modalities, lacking unified capabilities for understanding complex instructions and performing diverse video tasks.
4. 🛠️ Methods: Uses a two-stream architecture with frozen MLLM for instruction understanding and MMDiT for video generation, trained across multiple tasks including text/image-to-video generation and video editing through a three-stage training process.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance across multiple video tasks, demonstrates zero-shot generalization to unseen tasks, and shows strong capabilities in visual prompt understanding and task composition, evaluated through both human assessment and automatic metrics.

UniVideo: Unified Understanding, Generation, and Editing for Videos

UniVideo: Unified Understanding, Generation, and Editing for Videos Stage 1: Connector Alignment Train MLP connector only MLLM & MMDiT frozen 40M T2I + 10M T2V samples Image reconstruction task 15K steps Stage 2: Fine-tuning MLLM frozen Fine-tune connector & MMDiT 10K high-quality T2I & T2V 5K steps EMA ratio: 0.9999 Stage 3: Multi-task Training MLLM frozen Train connector & MMDiT All tasks unified 15K steps Mixed task sampling Dual-Stream Architecture Understanding Stream MLLM (Qwen2.5VL-7B) Semantic Encoder Multimodal Instructions Text + Image + Video Visual Prompt Understanding Generation Stream MMDiT (HunyuanVideo-13B) VAE Encoder Fine-grained Details Visual Generation Cross-stream Consistency MLP Connector 4x expansion Unified Task Framework Text-to-Video T2V Generation Multimodal Instructions Image-to-Video I2V Generation ID Preservation In-Context Gen Multi-ID Video Reference Images Video Editing Swap/Delete/Add Mask-free Style Transfer Artistic Style Motion Preserve Visual Prompting Canvas Drawing Annotation Key Capabilities & Generalization Zero-shot Transfer Image→Video Editing Free-form Instructions Task Composition Multiple Operations Single Instruction Unified Framework Single Model All Video Tasks SOTA Performance Competitive Results All Benchmarks
Q1
1. What is the key architectural innovation of UniVideo that enables it to handle both understanding and generation tasks?
A single stream architecture with multiple task-specific modules
A dual-stream design combining MLLM for understanding and MMDiT for generation
A transformer-based architecture with learnable query tokens
Q2
2. What unique capability does UniVideo demonstrate in terms of generalization?
It can only perform tasks it was explicitly trained on
It can generate high-resolution videos but cannot edit them
It can perform free-form video editing despite not being trained on such tasks
Q3
3. How does UniVideo handle video editing differently from existing methods?
It requires explicit mask inputs like all other video editing models
It only works with pre-defined editing templates
It can edit videos based on natural language instructions without requiring masks
1/2

Paper 3

DeepPrune: Parallel Scaling without Inter-trace Redundancy

Published: 2025-10-09

Link: http://arxiv.org/pdf/2510.08483

1. 📘 Topic and Domain: The paper focuses on efficient parallel scaling for large language models' reasoning capabilities through dynamic pruning of redundant reasoning traces.
2. 💡 Previous Research and New Ideas: Based on previous parallel scaling methods that generate multiple Chain-of-Thought traces simultaneously, the paper proposes a novel framework called DeepPrune that reduces computational redundancy while preserving answer diversity.
3. ❓ Problem: The paper addresses the inefficiency in parallel reasoning where over 80% of computational resources are wasted on generating equivalent reasoning paths that lead to identical answers.
4. 🛠️ Methods: The authors developed a specialized judge model trained with focal loss and oversampling techniques to predict answer equivalence from partial reasoning traces, combined with an online greedy clustering algorithm for dynamic pruning.
5. 📊 Results and Evaluation: DeepPrune achieved remarkable token reduction by over 80% compared to conventional consensus sampling while maintaining competitive accuracy within 3 percentage points, with the judge model reaching 0.87 AUROC on equivalence prediction.

DeepPrune: Parallel Scaling without Inter-trace Redundancy

DeepPrune Workflow Problem Analysis Identify Inter-trace Redundancy (80%+) Offline Training Trace Pair Collection Judge Model Training Focal Loss + Oversampling Online Pruning Greedy Clustering Dynamic Pruning Similarity Threshold τ Final Answer Majority Voting 80%+ Token Reduction Data Processing Flow Problem Generate Multiple Traces Judge Model Prediction Greedy Clustering Algorithm Pruned Diverse Traces Final Answer Key Technical Components Truncation Strategies • Fixed-length prefix • Reasoning-step alignment Training Techniques • Focal Loss • Oversampling Performance • AUROC: 0.87 • TNR@0.2: 0.82 Results • 80%+ Token Reduction • Accuracy Maintained Evaluation: AIME 2024/2025, GPQA | Models: DeepSeek-8B, Qwen3-32B, GPT-OSS-20B
Q1
1. What is the main efficiency problem that DeepPrune aims to solve in parallel reasoning?
High computational costs from using too many tokens
Over 80% of parallel reasoning traces yielding identical answers
Slow processing speed of language models
Q2
2. How does DeepPrune's judge model handle the class imbalance problem in training data?
By using data augmentation techniques
By discarding excess majority class samples
By combining focal loss with oversampling techniques
Q3
3. What was the most significant token reduction achieved by DeepPrune while maintaining accuracy?
Up to 91.6% on AIME25 dataset
Around 50% across all datasets
Up to 75% on GPQA dataset