2025-03-26 Papers

Paper 1

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Published: 2025-03-24

Link: http://arxiv.org/pdf/2503.19325

1. 📘 Topic and Domain: Long-context autoregressive video modeling using next-frame prediction techniques in computer vision and deep learning.
2. 💡 Previous Research and New Ideas: Based on language model autoregressive techniques and video diffusion models, introduces new Frame AutoRegressive (FAR) model with FlexRoPE and long short-term context modeling.
3. ❓ Problem: The challenge of effectively utilizing extended temporal contexts in video generation while managing visual redundancy and computational costs.
4. 🛠️ Methods: Uses frame-wise flow matching with stochastic clean context training, FlexRoPE for temporal decay, and long short-term context modeling for efficient processing of long videos.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance in both short and long video generation, with 16× longer temporal extrapolation and better convergence than video diffusion transformers.
Q1
1. What is the main innovation introduced by FAR to handle the training-inference gap in observed context?
Using double training cost with clean copies
Stochastic clean context with unique timestep embedding
Increasing the size of context window
Q2
2. What is the maximum temporal extrapolation capability achieved by FAR with FlexRoPE compared to training length?
8x longer
12x longer
16x longer
Q3
3. What unique approach does FAR use to handle token redundancy in long videos?
Long short-term context modeling with different resolutions
Simple frame compression
Reducing frame rate

Paper 2

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Published: 2025-03-25

Link: http://arxiv.org/pdf/2503.19622

1. 📘 Topic and Domain: The paper explores hallucination issues in large multimodal models (LMMs) specifically for video understanding tasks, focusing on cases where models provide incorrect responses despite appearing confident.
2. 💡 Previous Research and New Ideas: Previous research focused on hallucination in image and text modalities, while this paper introduces the first comprehensive benchmark for evaluating hallucinations in video understanding.
3. ❓ Problem: The paper aims to address the lack of systematic evaluation methods for hallucinations in video understanding models and proposes solutions to mitigate these hallucinations.
4. 🛠️ Methods: The authors created HAVEN benchmark with 6K questions across three dimensions (hallucination causes, aspects, and question formats), evaluated 16 LMMs, and developed a video-thinking model using supervised reasoning fine-tuning (SRFT) and thinking-based direct preference optimization (TDPO).
5. 📊 Results and Evaluation: The proposed thinking-based training strategy improved baseline accuracy by 7.65% in hallucination evaluation and reduced bias score by 4.5%, with Valley-Eagle-7B and GPT4o-mini showing the best performance among tested models.
Q1
1. What are the three dimensions used in the HAVEN benchmark for evaluating hallucinations?
Model size, video duration, and frame count
Hallucination causes, hallucination aspects, and question formats
Visual quality, audio quality, and text coherence
Q2
2. Which training strategy was proposed to mitigate hallucinations in the video-thinking model?
Continuous pre-training with video data only
Multi-task learning with image and video inputs
Supervised reasoning fine-tuning (SRFT) combined with thinking-based direct preference optimization (TDPO)
Q3
3. What was the most significant improvement achieved by the proposed thinking-based training strategy?
A 7.65% increase in accuracy and 4.5% reduction in bias score
A 15% increase in video processing speed
A 20% reduction in model parameter count

Paper 3

Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

Published: 2025-03-25

Link: http://arxiv.org/pdf/2503.19385

1. 📘 Topic and Domain: Inference-time scaling for flow-based generative models in computer vision, specifically focusing on improving text-to-image generation quality without additional training.
2. 💡 Previous Research and New Ideas: Based on diffusion model inference-time scaling research; proposes new methods to enable particle sampling in flow models through stochastic generation and adaptive budget allocation.
3. ❓ Problem: Flow models lack stochasticity in their generative process, making it difficult to apply effective particle sampling methods that work well in diffusion models for improving generation quality.
4. 🛠️ Methods: Introduces three key components: SDE-based generation to enable particle sampling, Variance-Preserving interpolant conversion to increase sample diversity, and Rollover Budget Forcing for adaptive compute allocation.
5. 📊 Results and Evaluation: Achieved superior performance in compositional text-to-image generation and quantity-aware image generation tasks, outperforming previous methods while using fewer function evaluations, and demonstrated particularly strong results when combined with gradient-based methods for aesthetic image generation.
Q1
1. What is the main challenge that prevents flow models from using particle sampling methods effectively?
Flow models are too slow at generating images
Flow models lack stochasticity in their generative process
Flow models require too much training data
Q2
2. Which component in the paper's method is responsible for increasing sample diversity during generation?
Rollover Budget Forcing
SDE-based generation
Variance-Preserving interpolant conversion
Q3
3. What advantage do flow models maintain over diffusion models even after adding stochasticity?
They produce clearer expected outputs at intermediate steps
They require less memory during inference
They can be trained faster