2025-10-24 Papers

1/2

Paper 1

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Published: 2025-10-23

Link: http://arxiv.org/pdf/2510.20579

1. 📘 Topic and Domain: Video reasoning with explicit spatio-temporal evidence grounding, in the domain of multimodal AI and computer vision.

2. 💡 Previous Research and New Ideas: Based on OpenAI-o3's evidence-centered reasoning for images, proposes extending this to videos by integrating explicit spatio-temporal evidence into video reasoning.

3. ❓ Problem: Existing video reasoning models only generate textual reasoning without indicating when and where key evidence appears, making it difficult to verify their reasoning.

4. 🛠️ Methods: Uses a two-stage approach: (1) curates high-quality spatio-temporal training data, and (2) applies reinforcement learning with adaptive temporal proximity and temporal gating mechanisms to optimize temporal and spatial grounding.

5. 📊 Results and Evaluation: Achieves state-of-the-art performance on V-STAR benchmark, improving mAM by 14.4% and mLGM by 24.2% over baseline Qwen2.5-VL, with consistent gains on other video understanding benchmarks.

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

1/2

Paper 2

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Published: 2025-10-23

Link: http://arxiv.org/pdf/2510.20822

1. 📘 Topic and Domain: Text-to-video generation focusing on creating coherent, multi-shot cinematic narratives using AI.

2. 💡 Previous Research and New Ideas: Based on previous single-shot video generation models and diffusion transformers, proposing a novel holistic approach that generates entire scenes in one pass rather than sequential or chunk-based generation.

3. ❓ Problem: Bridging the "narrative gap" between current AI's ability to generate isolated video clips versus creating coherent, multi-shot narratives that maintain consistency across scenes.

4. 🛠️ Methods: Developed HoloCine framework with two key mechanisms: Window Cross-Attention for precise directorial control and Sparse Inter-Shot Self-Attention for efficient computation across long videos.

5. 📊 Results and Evaluation: Achieved state-of-the-art performance in narrative coherence, transition control, and character consistency, while demonstrating emergent capabilities like persistent memory for characters and intuitive grasp of cinematic techniques.

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

1/2

Paper 3

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

Published: 2025-10-22

Link: http://arxiv.org/pdf/2510.19779

1. 📘 Topic and Domain: Efficient speculative decoding for large language models through selective knowledge distillation.

2. 💡 Previous Research and New Ideas: Based on conventional knowledge distillation and speculative decoding methods, proposes a novel selective token filtering approach for more efficient knowledge transfer.

3. ❓ Problem: Addresses the inefficiency in traditional knowledge distillation methods where draft models struggle to fully assimilate target model knowledge due to capacity constraints.

4. 🛠️ Methods: Introduces AdaSPEC, a two-phase approach using reference model distillation to identify hard-to-fit tokens, then selectively distilling knowledge on easier tokens to the draft model.

5. 📊 Results and Evaluation: Consistently outperformed state-of-the-art DistillSpec method across diverse tasks (arithmetic, instruction-following, coding, summarization), achieving up to 15% higher acceptance rates with model configurations of 31M/1.4B and 350M/2.7B parameters.