1. 📘 Topic and Domain: The paper explores hallucination issues in large multimodal models (LMMs) specifically for video understanding tasks, focusing on cases where models provide incorrect responses despite appearing confident.
2. 💡 Previous Research and New Ideas: Previous research focused on hallucination in image and text modalities, while this paper introduces the first comprehensive benchmark for evaluating hallucinations in video understanding.
3. ❓ Problem: The paper aims to address the lack of systematic evaluation methods for hallucinations in video understanding models and proposes solutions to mitigate these hallucinations.
4. 🛠️ Methods: The authors created HAVEN benchmark with 6K questions across three dimensions (hallucination causes, aspects, and question formats), evaluated 16 LMMs, and developed a video-thinking model using supervised reasoning fine-tuning (SRFT) and thinking-based direct preference optimization (TDPO).
5. 📊 Results and Evaluation: The proposed thinking-based training strategy improved baseline accuracy by 7.65% in hallucination evaluation and reduced bias score by 4.5%, with Valley-Eagle-7B and GPT4o-mini showing the best performance among tested models.