2025-03-26 Papers

Paper 1

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Published: 2025-03-24

Link: http://arxiv.org/pdf/2503.19325

1. 📘 Topic and Domain: Long-context autoregressive video modeling using next-frame prediction techniques in computer vision and deep learning.

2. 💡 Previous Research and New Ideas: Based on language model autoregressive techniques and video diffusion models, introduces new Frame AutoRegressive (FAR) model with FlexRoPE and long short-term context modeling.

3. ❓ Problem: The challenge of effectively utilizing extended temporal contexts in video generation while managing visual redundancy and computational costs.

4. 🛠️ Methods: Uses frame-wise flow matching with stochastic clean context training, FlexRoPE for temporal decay, and long short-term context modeling for efficient processing of long videos.

5. 📊 Results and Evaluation: Achieved state-of-the-art performance in both short and long video generation, with 16× longer temporal extrapolation and better convergence than video diffusion transformers.

Paper 2

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Published: 2025-03-25

Link: http://arxiv.org/pdf/2503.19622

1. 📘 Topic and Domain: The paper explores hallucination issues in large multimodal models (LMMs) specifically for video understanding tasks, focusing on cases where models provide incorrect responses despite appearing confident.

2. 💡 Previous Research and New Ideas: Previous research focused on hallucination in image and text modalities, while this paper introduces the first comprehensive benchmark for evaluating hallucinations in video understanding.

3. ❓ Problem: The paper aims to address the lack of systematic evaluation methods for hallucinations in video understanding models and proposes solutions to mitigate these hallucinations.

4. 🛠️ Methods: The authors created HAVEN benchmark with 6K questions across three dimensions (hallucination causes, aspects, and question formats), evaluated 16 LMMs, and developed a video-thinking model using supervised reasoning fine-tuning (SRFT) and thinking-based direct preference optimization (TDPO).

5. 📊 Results and Evaluation: The proposed thinking-based training strategy improved baseline accuracy by 7.65% in hallucination evaluation and reduced bias score by 4.5%, with Valley-Eagle-7B and GPT4o-mini showing the best performance among tested models.

Paper 3

Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

Published: 2025-03-25

Link: http://arxiv.org/pdf/2503.19385

1. 📘 Topic and Domain: Inference-time scaling for flow-based generative models in computer vision, specifically focusing on improving text-to-image generation quality without additional training.

2. 💡 Previous Research and New Ideas: Based on diffusion model inference-time scaling research; proposes new methods to enable particle sampling in flow models through stochastic generation and adaptive budget allocation.

3. ❓ Problem: Flow models lack stochasticity in their generative process, making it difficult to apply effective particle sampling methods that work well in diffusion models for improving generation quality.

4. 🛠️ Methods: Introduces three key components: SDE-based generation to enable particle sampling, Variance-Preserving interpolant conversion to increase sample diversity, and Rollover Budget Forcing for adaptive compute allocation.

5. 📊 Results and Evaluation: Achieved superior performance in compositional text-to-image generation and quantity-aware image generation tasks, outperforming previous methods while using fewer function evaluations, and demonstrated particularly strong results when combined with gradient-based methods for aesthetic image generation.