2026-01-21 Papers

1/2

Paper 1

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.14250

1. 📘 Topic and Domain: The paper presents OmniTransfer, a unified framework for spatio-temporal video transfer that handles both appearance (ID, style) and temporal (motion, camera movement, effects) transfer tasks in video generation.
2. 💡 Previous Research and New Ideas: The paper builds on diffusion-based video generation methods (Wan2.1 I2V) and proposes leveraging multi-frame reference videos instead of single images, introducing the novel assumption that video diffusion models can handle temporal consistency through spatial context.
3. ❓ Problem: The paper addresses the limitation that existing video customization methods rely on single reference images or task-specific priors, failing to fully exploit the rich spatio-temporal information inherent in reference videos.
4. 🛠️ Methods: The authors propose three key components: Task-aware Positional Bias (applying spatial/temporal offsets to RoPE), Reference-decoupled Causal Learning (separating reference and target branches), and Task-adaptive Multimodal Alignment (using MLLM with task-specific MetaQueries).
5. 📊 Results and Evaluation: OmniTransfer outperforms existing methods in appearance and temporal transfer tasks across multiple metrics (VSim scores, user studies) while achieving 20% faster inference than baseline architectures, with evaluations on curated test sets of 50-100 videos per task.

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

OmniTransfer Framework Workflow Reference Video (Spatial + Temporal Info) Target First Frame (Optional) Text Prompt (Optional) Reference Latent Construction VAE Encoder Task Flags Task-aware Positional Bias Temporal Tasks: Spatial RoPE Offset Appearance Tasks: Temporal RoPE Offset Reference-decoupled Causal Learning Reference Branch Self-Attention Only Fixed t=0 Single Forward Pass 20% Faster Target Branch Cross-Attention Causal Transfer Time-dependent Task-adaptive Multimodal Alignment MLLM Qwen-2.5-VL + LoRA MetaQueries Task-specific Tokens MLP Connector 3-layer Target Video Generated Output Key Features: • Multi-view reference • Unified framework • Efficient processing • Task adaptation
Q1
1. What key assumption does OmniTransfer make about video diffusion models that differs from previous approaches?
Video diffusion models require explicit pose priors to maintain temporal consistency
Video diffusion models are inherently capable of handling temporal consistency through spatial context
Video diffusion models can only process single-frame references effectively
Q2
2. How does Reference-decoupled Causal Learning address the 'copy-paste' problem in video transfer?
By using bidirectional attention between reference and target branches with full context sharing
By applying temporal masking to prevent direct copying of reference frames
By implementing unidirectional transfer where the reference branch cannot access target context
Q3
3. What computational advantage does OmniTransfer achieve compared to standard architectures?
20% reduction in inference time by making the reference branch time-invariant
50% fewer parameters through aggressive model pruning
4x faster training by using smaller batch sizes
1/2

Paper 2

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.13836

1. 📘 Topic and Domain: The paper introduces FutureOmni, a benchmark for evaluating multimodal large language models' ability to forecast future events from audio-visual contexts in video understanding.
2. 💡 Previous Research and New Ideas: Building on existing multimodal benchmarks that focus on retrospective understanding, the paper proposes the first benchmark specifically designed for omni-modal future forecasting with cross-modal causal reasoning from audio-visual environments.
3. ❓ Problem: Current multimodal LLMs lack evaluation for their ability to predict future events from audio-visual cues, as existing benchmarks primarily assess retrospective comprehension rather than forward-looking reasoning.
4. 🛠️ Methods: The authors construct FutureOmni through a scalable LLM-assisted, human-in-the-loop pipeline, creating 919 videos with 1,034 multiple-choice QA pairs, and propose an Omni-Modal Future Forecasting (OFF) training strategy with a 7K-sample instruction-tuning dataset.
5. 📊 Results and Evaluation: Evaluation of 20 MLLMs shows that current models struggle with audio-visual future prediction (best accuracy: 64.8% by Gemini 3 Flash), while the proposed OFF method improves future forecasting performance and generalization across multiple benchmarks.

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni Pipeline Audio Coordinated Video Selection Dynamic Scene Filter YouTube Domain List Audio-Visual Temporal Localization & Calibration Time Boundary Check MFCCs Audio-Visual OFP QA Construction Grounding Detection Distraction Filtered Videos 9K videos Initial Caption with timestamps Audio Fulfilled Captions Forecasting Pairs with Rationales Dual-Stage Verification GPT-4o Check Human Review FutureOmni 919 Videos 1,034 QA Pairs
Q1
1. What type of adversarial distractors does FutureOmni use to prevent shortcut learning in MLLMs?
Visual-only perception, audio-only perception, delayed events, and reverse-causal options
Random noise injection, temporal shuffling, modality dropout, and semantic perturbation
Object substitution, scene inversion, audio mismatch, and temporal compression
Q2
2. According to the error analysis of Gemini 3 Flash's failures on FutureOmni, what was identified as the primary bottleneck?
Lack of world knowledge accounting for 51.6% of errors
Video perception errors accounting for 51.6% of errors
Audio perception errors accounting for 51.6% of errors
Q3
3. What phenomenon did the researchers observe regarding model performance across different video durations?
Models performed best on the longest videos due to more contextual information
Models showed consistent performance regardless of video duration
Models struggled most with shortest duration videos, peaking at medium duration [2,4) minutes
1/2

Paper 3

Think3D: Thinking with Space for Spatial Reasoning

Published: 2026-01-19

Link: http://arxiv.org/pdf/2601.13029

1. 📘 Topic and Domain: The paper addresses spatial reasoning in vision-language models (VLMs) by enabling them to interact with 3D reconstructed environments rather than relying solely on 2D image perception.
2. 💡 Previous Research and New Ideas: Building on prior "think with image" approaches that use 2D tools (zoom, crop, depth estimation), the paper introduces "think with space" - allowing VLMs to actively manipulate 3D point clouds reconstructed from multi-view images, using camera poses as spatial anchors for coherent 3D exploration.
3. ❓ Problem: Current VLMs struggle with genuine 3D reasoning tasks (multi-view understanding, route planning) because they remain fundamentally 2D perceivers, unable to build consistent 3D representations needed for spatial intelligence.
4. 🛠️ Methods: Think3D uses a 3D manipulation toolkit (Pi3 for reconstruction, camera-based transformations, novel view rendering) enabling iterative observe→manipulate→reflect loops, plus Think3D-RL which trains smaller models via reinforcement learning (GRPO) to learn effective viewpoint selection strategies.
5. 📊 Results and Evaluation: On BLINK Multi-view and MindCube, Think3D achieves +7.8% average gain for GPT-4.1/Gemini-2.5-Pro and +4.7% on VSI-Bench; with RL training, smaller models improve from +0.7% to +6.8% benefit from spatial exploration, demonstrating that learned exploration policies significantly enhance 3D reasoning capabilities.

Think3D: Thinking with Space for Spatial Reasoning

Think3D: Spatial Reasoning Framework Input Multi-view Images {I_t}^T_t=1 Query q 3D Reconstruction Pi3 Model Point Cloud X Camera Poses C_t VLM Agent Spatial Reasoning Tool Calling π_θ(q, {I_t}, H_k-1) 3D Manipulation Toolkit Camera Selection 3D Rotation (Δα, Δβ) View Mode Global/Ego Novel View Rendering: Î_k Iterative 3D CoT Observe → Manipulate → Reflect Think3D-RL Reinforcement Learning Module GRPO Training Policy: a_k selection Output Spatial Answer ŷ 2D inputs tool calls iterative policy learning
Q1
1. What key innovation enables Think3D to maintain coherent spatial reasoning when manipulating 3D point clouds?
Using camera poses as anchors to provide stable reference frames for 3D transformations
Training on massive 3D datasets with diverse spatial configurations
Implementing advanced depth estimation algorithms for 2.5D perception
Q2
2. According to the experimental results, what happens to smaller VLMs after Think3D-RL training?
They completely match the performance of GPT-4.1 on all spatial tasks
Their exploration patterns align more closely with stronger models, improving tool usage benefits from +0.7% to +6.8%
They learn to avoid using 3D tools entirely and rely on 2D reasoning
Q3
3. What observation did the authors make about task-specific exploration strategies in their ablation studies?
All tasks require identical viewpoint distributions for optimal performance
Route planning tasks prefer top-down views while object orientation tasks rely more on rotational viewpoints
Ego-centric views are universally superior to global views across all spatial reasoning tasks