2025-10-24 Papers

1/2

Paper 1

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Published: 2025-10-23

Link: http://arxiv.org/pdf/2510.20579

1. 📘 Topic and Domain: Video reasoning with explicit spatio-temporal evidence grounding, in the domain of multimodal AI and computer vision.
2. 💡 Previous Research and New Ideas: Based on OpenAI-o3's evidence-centered reasoning for images, proposes extending this to videos by integrating explicit spatio-temporal evidence into video reasoning.
3. ❓ Problem: Existing video reasoning models only generate textual reasoning without indicating when and where key evidence appears, making it difficult to verify their reasoning.
4. 🛠️ Methods: Uses a two-stage approach: (1) curates high-quality spatio-temporal training data, and (2) applies reinforcement learning with adaptive temporal proximity and temporal gating mechanisms to optimize temporal and spatial grounding.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance on V-STAR benchmark, improving mAM by 14.4% and mLGM by 24.2% over baseline Qwen2.5-VL, with consistent gains on other video understanding benchmarks.

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Open-o3 Video: Grounded Video Reasoning Workflow Data Construction Temporal Grounding Sources PLM-Rdcap Sources 5.9k Manual Annotations STGR-CoT-30k + STGR-RL-36k Annotation Pipeline Gemini 2.5 Pro Initial Annotation Bounding Box Filtering Self-consistency Checking Cold Start Training SFT Qwen2.5-VL-7B Base Model + STGR-CoT-30k Learning Rate: 1×10⁻⁶ 1 Epoch Reinforcement Learning (GSPO) Accuracy r_acc MCQ/ROUGE/IoU Temporal r_t Adaptive Proximity Spatial r_s Temporal Gating Key Innovations • Adaptive Temporal Proximity: σ = 4→1 • Temporal Gating: τ = 3s threshold Spatio-Temporal Output <think> The video shows <obj>woman</obj> <box>[374,67,420,224]</box>at<t>9.2</t>s wearing a red vest... </think> <answer>Red vest over white shirt</answer> Evaluation Results V-STAR +14.4% mAM VideoMME +1.2% Overall WorldSense +1.4% Overall TVGBench +4.5 mIoU Test-time Scaling Generate N=8 responses Confidence-aware voting with evidence +1.0% on VideoMMMU +1.2% on WorldSense
Q1
1. What is the main innovation of Open-o3 Video compared to previous video reasoning models?
It uses more advanced language models for reasoning
It incorporates explicit spatio-temporal evidence in its reasoning process
It processes videos at a much faster speed
Q2
2. Which training strategy does Open-o3 Video use to improve its performance?
Single-stage end-to-end training
Unsupervised pre-training followed by fine-tuning
Two-stage approach with supervised fine-tuning and reinforcement learning
Q3
3. What was the improvement in mAM (mean Arithmetic Mean) achieved by Open-o3 Video over the Qwen2.5-VL baseline?
5.2%
14.4%
24.2%
1/2

Paper 2

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Published: 2025-10-23

Link: http://arxiv.org/pdf/2510.20822

1. 📘 Topic and Domain: Text-to-video generation focusing on creating coherent, multi-shot cinematic narratives using AI.
2. 💡 Previous Research and New Ideas: Based on previous single-shot video generation models and diffusion transformers, proposing a novel holistic approach that generates entire scenes in one pass rather than sequential or chunk-based generation.
3. ❓ Problem: Bridging the "narrative gap" between current AI's ability to generate isolated video clips versus creating coherent, multi-shot narratives that maintain consistency across scenes.
4. 🛠️ Methods: Developed HoloCine framework with two key mechanisms: Window Cross-Attention for precise directorial control and Sparse Inter-Shot Self-Attention for efficient computation across long videos.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance in narrative coherence, transition control, and character consistency, while demonstrating emergent capabilities like persistent memory for characters and intuitive grasp of cinematic techniques.

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives Data Curation Shot Segmentation Multi-Shot Assembly Hierarchical Captioning 400k samples with Gemini 2.5 HoloCine Architecture Window Cross-Attention Localizes text prompts to specific shots Enables precise directorial control Sparse Inter-Shot Self-Attention Dense within shots Sparse between shots O(N×L²) → O(N×L) Holistic Generation Process All shots processed simultaneously in single diffusion pass Built on DiT-based video diffusion model (Wan2.2 14B) Joint self-attention for global consistency Training Setup 10k steps 128 NVIDIA H800 GPUs 480×832 resolution Max 13 shots/video FSDP + Context Parallelism Evaluation Metrics • Shot Cut Accuracy (SCA) • Inter-shot Consistency • Intra-shot Consistency • Semantic Consistency • Aesthetic Quality • VBench benchmark • 100 diverse test prompts Emergent Capabilities • Persistent Memory • Character Consistency • Long-range Re-appearance • Fine-grained Detail Persistence • Cinematic Language Control • Shot Scale/Angle/Movement Comparison Results vs Wan2.2 (pre-trained) vs StoryDiffusion+Wan2.2 vs IC-LoRA+Wan2.2 vs CineTrans vs Commercial models (Vidu, Kling, Sora2) State-of-the-art Results Ablation Studies w/o Window Lower SCA Poor shot control Full vs Sparse Similar quality Much faster w/o Summary Loss of consistency Character drift Limitation: Causal Reasoning Prioritizes visual consistency over logical consequences
Q1
1. What is the main technical innovation that allows HoloCine to maintain efficiency when generating long videos?
Window Cross-Attention mechanism
Sparse Inter-Shot Self-Attention pattern
Hierarchical prompt structure
Q2
2. What unexpected capability emerged from HoloCine that wasn't explicitly designed for?
The ability to generate special effects
The ability to create original music scores
Persistent memory for characters and scenes across shots
Q3
3. What fundamental problem in AI video generation does HoloCine address?
The 'narrative gap' between single clips and coherent multi-shot stories
The inability to generate high-resolution videos
The lack of realistic audio generation
1/2

Paper 3

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

Published: 2025-10-22

Link: http://arxiv.org/pdf/2510.19779

1. 📘 Topic and Domain: Efficient speculative decoding for large language models through selective knowledge distillation.
2. 💡 Previous Research and New Ideas: Based on conventional knowledge distillation and speculative decoding methods, proposes a novel selective token filtering approach for more efficient knowledge transfer.
3. ❓ Problem: Addresses the inefficiency in traditional knowledge distillation methods where draft models struggle to fully assimilate target model knowledge due to capacity constraints.
4. 🛠️ Methods: Introduces AdaSPEC, a two-phase approach using reference model distillation to identify hard-to-fit tokens, then selectively distilling knowledge on easier tokens to the draft model.
5. 📊 Results and Evaluation: Consistently outperformed state-of-the-art DistillSpec method across diverse tasks (arithmetic, instruction-following, coding, summarization), achieving up to 15% higher acceptance rates with model configurations of 31M/1.4B and 350M/2.7B parameters.

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

AdaSPEC: Selective Knowledge Distillation Workflow Phase 1: Reference Model Construction Target Model Mp (Fine-tuned) KL Divergence Reference Model Mref Dataset D Phase 2: Token Filtering Process Compute Token Losses L_ref(w) = KL(P||R) L_draft(w) = KL(P||Q) ΔL(w) = L_draft - L_ref Select Top k% tokens by ΔL(w) Filtered Token Set S Phase 3: Selective Draft Model Distillation Draft Model Mq (Initial) Selective Distillation Loss L_distill = (1/k·|y|) Σ I[y_i ∈ S] · L_draft(y_i) Focus only on filtered tokens Maximize limited capacity utilization Improve alignment on "easy" tokens Optimized Draft Model Enhanced Performance Higher Acceptance Rate (up to 15% improvement) Key Innovation: Capacity-Aware Token Selection Instead of learning all tokens uniformly, AdaSPEC identifies "hard" tokens that waste the draft model's limited capacity and focuses training on "learnable" tokens This selective approach maximizes alignment between draft and target models
Q1
1. What is the main innovation of AdaSPEC compared to traditional knowledge distillation methods?
It uses a larger target model
It filters out hard-to-learn tokens and focuses on easier ones
It increases the training epochs
Q2
2. What metric does AdaSPEC aim to optimize in speculative decoding?
Model size reduction
Training speed
Token acceptance rate
Q3
3. In the experiments, what was the maximum improvement in acceptance rate achieved by AdaSPEC compared to DistillSpec?
Up to 15%
Up to 25%
Up to 5%