2026-01-21 Papers

1/2

Paper 1

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.14250

1. 📘 Topic and Domain: The paper presents OmniTransfer, a unified framework for spatio-temporal video transfer that handles both appearance (ID, style) and temporal (motion, camera movement, effects) transfer tasks in video generation.

2. 💡 Previous Research and New Ideas: The paper builds on diffusion-based video generation methods (Wan2.1 I2V) and proposes leveraging multi-frame reference videos instead of single images, introducing the novel assumption that video diffusion models can handle temporal consistency through spatial context.

3. ❓ Problem: The paper addresses the limitation that existing video customization methods rely on single reference images or task-specific priors, failing to fully exploit the rich spatio-temporal information inherent in reference videos.

4. 🛠️ Methods: The authors propose three key components: Task-aware Positional Bias (applying spatial/temporal offsets to RoPE), Reference-decoupled Causal Learning (separating reference and target branches), and Task-adaptive Multimodal Alignment (using MLLM with task-specific MetaQueries).

5. 📊 Results and Evaluation: OmniTransfer outperforms existing methods in appearance and temporal transfer tasks across multiple metrics (VSim scores, user studies) while achieving 20% faster inference than baseline architectures, with evaluations on curated test sets of 50-100 videos per task.

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

1/2

Paper 2

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Published: 2026-01-20

Link: http://arxiv.org/pdf/2601.13836

1. 📘 Topic and Domain: The paper introduces FutureOmni, a benchmark for evaluating multimodal large language models' ability to forecast future events from audio-visual contexts in video understanding.

2. 💡 Previous Research and New Ideas: Building on existing multimodal benchmarks that focus on retrospective understanding, the paper proposes the first benchmark specifically designed for omni-modal future forecasting with cross-modal causal reasoning from audio-visual environments.

3. ❓ Problem: Current multimodal LLMs lack evaluation for their ability to predict future events from audio-visual cues, as existing benchmarks primarily assess retrospective comprehension rather than forward-looking reasoning.

4. 🛠️ Methods: The authors construct FutureOmni through a scalable LLM-assisted, human-in-the-loop pipeline, creating 919 videos with 1,034 multiple-choice QA pairs, and propose an Omni-Modal Future Forecasting (OFF) training strategy with a 7K-sample instruction-tuning dataset.

5. 📊 Results and Evaluation: Evaluation of 20 MLLMs shows that current models struggle with audio-visual future prediction (best accuracy: 64.8% by Gemini 3 Flash), while the proposed OFF method improves future forecasting performance and generalization across multiple benchmarks.

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

1/2

Paper 3

Think3D: Thinking with Space for Spatial Reasoning

Published: 2026-01-19

Link: http://arxiv.org/pdf/2601.13029

1. 📘 Topic and Domain: The paper addresses spatial reasoning in vision-language models (VLMs) by enabling them to interact with 3D reconstructed environments rather than relying solely on 2D image perception.

2. 💡 Previous Research and New Ideas: Building on prior "think with image" approaches that use 2D tools (zoom, crop, depth estimation), the paper introduces "think with space" - allowing VLMs to actively manipulate 3D point clouds reconstructed from multi-view images, using camera poses as spatial anchors for coherent 3D exploration.

3. ❓ Problem: Current VLMs struggle with genuine 3D reasoning tasks (multi-view understanding, route planning) because they remain fundamentally 2D perceivers, unable to build consistent 3D representations needed for spatial intelligence.

4. 🛠️ Methods: Think3D uses a 3D manipulation toolkit (Pi3 for reconstruction, camera-based transformations, novel view rendering) enabling iterative observe→manipulate→reflect loops, plus Think3D-RL which trains smaller models via reinforcement learning (GRPO) to learn effective viewpoint selection strategies.

5. 📊 Results and Evaluation: On BLINK Multi-view and MindCube, Think3D achieves +7.8% average gain for GPT-4.1/Gemini-2.5-Pro and +4.7% on VSI-Bench; with RL training, smaller models improve from +0.7% to +6.8% benefit from spatial exploration, demonstrating that learned exploration policies significantly enhance 3D reasoning capabilities.