2025-05-30 Papers

1/2

Paper 1

Table-R1: Inference-Time Scaling for Table Reasoning

Published: 2025-05-29

Link: http://arxiv.org/pdf/2505.23621

1. 📘 Topic and Domain: The paper explores inference-time scaling for table reasoning tasks, focusing on enhancing language models' ability to reason with tabular data.

2. 💡 Previous Research and New Ideas: The paper builds on recent work in inference-time scaling for language models (like OpenAI's o-series) and proposes two novel post-training strategies specifically for table reasoning tasks.

3. ❓ Problem: The paper addresses the challenge of applying inference-time scaling to structure-dependent tasks, particularly table reasoning, which requires interpreting diverse cell contents, aligning data, and performing multi-step reasoning.

4. 🛠️ Methods: The authors develop two approaches: (1) distillation from frontier model reasoning traces (Table-R1-SFT) and (2) reinforcement learning with verifiable rewards (Table-R1-Zero), both applied to 7B-parameter language models.

5. 📊 Results and Evaluation: The Table-R1-Zero model matches or exceeds the performance of larger models like GPT-4.1 and DeepSeek-R1 across diverse table reasoning tasks while using only a 7B-parameter model, with strong generalization to out-of-domain datasets.

Table-R1: Inference-Time Scaling for Table Reasoning

1/2

Paper 2

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Published: 2025-05-29

Link: http://arxiv.org/pdf/2505.23693

1. 📘 Topic and Domain: Evaluating multimodal large language models' (MLLMs) ability to generate feedback on AI-generated content (AIGC) videos through a new benchmark called VF-EVAL.

2. 💡 Previous Research and New Ideas: Based on existing video understanding benchmarks that focus mainly on natural videos, this paper proposes a novel benchmark specifically for synthetic/AI-generated videos and introduces four comprehensive evaluation tasks.

3. ❓ Problem: The paper addresses the lack of systematic evaluation methods for assessing MLLMs' capabilities in interpreting and providing feedback on AIGC videos, which have different characteristics from natural videos.

4. 🛠️ Methods: Created VF-EVAL benchmark with four tasks (coherence validation, error awareness, error type detection, and reasoning evaluation) and evaluated 13 frontier MLLMs using chain-of-thought prompting.

5. 📊 Results and Evaluation: Even the best-performing model (GPT-4.1) struggled to achieve consistent performance across all tasks, highlighting the benchmark's challenging nature and the current limitations of MLLMs in understanding AIGC videos.

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

1/2

Paper 3

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Published: 2025-05-29

Link: http://arxiv.org/pdf/2505.23747

1. 📘 Topic and Domain: The paper focuses on enhancing Multimodal Large Language Models' (MLLMs) spatial intelligence capabilities for understanding and reasoning about 3D scenes from 2D video inputs.

2. 💡 Previous Research and New Ideas: Previous research relied on additional 3D/2.5D data for spatial understanding; this paper proposes using only 2D video inputs by combining semantic and structural features through a dual-encoder architecture initialized with visual geometry foundation models.

3. ❓ Problem: The paper addresses MLLMs' limited ability to understand and reason about 3D spatial relationships when only given 2D video inputs, without access to additional 3D data like point clouds or depth maps.

4. 🛠️ Methods: The paper implements a dual-encoder architecture (2D semantic encoder + spatial encoder), a connector module for feature fusion, and a space-aware frame sampling strategy, trained on their Spatial-MLLM-120k dataset using supervised fine-tuning and GRPO.

5. 📊 Results and Evaluation: The model achieves state-of-the-art performance on multiple benchmarks including VSI-Bench, ScanQA, and SQA3D, outperforming both proprietary and open-source models despite having fewer parameters (4B vs 72B).