2026-02-02 Papers

1/2

Paper 1

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

Published: 2026-01-28

Link: http://arxiv.org/pdf/2601.20833

1. 📘 Topic and Domain: The paper focuses on autonomous scientific discovery using large language models (LLMs) in the domain of AI-assisted research automation.

2. 💡 Previous Research and New Ideas: Building on runtime-centric LLM research agents that repeatedly process literature online, the paper proposes Idea2Story, a pre-computation-driven framework that shifts literature understanding to offline knowledge graph construction.

3. ❓ Problem: The paper addresses the computational inefficiency and unreliability of existing LLM-based research agents that repeatedly read and reason over large volumes of scientific literature at runtime.

4. 🛠️ Methods: The authors use a two-stage approach: offline construction of a methodological knowledge graph from peer-reviewed papers and reviews, followed by online research generation through graph-based retrieval and composition of reusable method units.

5. 📊 Results and Evaluation: Qualitative analyses and case studies demonstrate that Idea2Story generates more coherent, methodologically grounded, and novel research patterns compared to direct LLM generation, with external evaluation consistently favoring Idea2Story's outputs.

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

1/2

Paper 2

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Published: 2026-01-28

Link: http://arxiv.org/pdf/2601.20354

1. 📘 Topic and Domain: The paper focuses on benchmarking spatial intelligence of text-to-image models, evaluating their ability to understand and generate complex spatial relationships in images.

2. 💡 Previous Research and New Ideas: Building on existing T2I benchmarks that use short or information-sparse prompts, this work proposes long, information-dense prompts covering 10 spatial sub-domains and introduces omni-dimensional multi-choice evaluations instead of simple yes/no questions.

3. ❓ Problem: Current T2I models excel at generating objects but fail at handling complex spatial relationships like positioning, orientation, occlusion, and causal interactions, which existing benchmarks fail to adequately evaluate.

4. 🛠️ Methods: The authors create SpatialGenEval with 1,230 information-dense prompts across 25 real-world scenes, each paired with 10 multi-choice questions targeting different spatial abilities, and construct SpatialT2I dataset with 15,400 text-image pairs for fine-tuning.

5. 📊 Results and Evaluation: Evaluation of 23 SOTA models reveals spatial reasoning as the primary bottleneck (scores often below 30%), while fine-tuning with SpatialT2I yields consistent improvements (+4.2% for SD-XL, +5.7% for UniWorld-V1, +4.4% for OmniGen2).

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

1/2

Paper 3

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Published: 2026-01-29

Link: http://arxiv.org/pdf/2601.21821

1. 📘 Topic and Domain: The paper focuses on multimodal reasoning in vision-language models, specifically addressing the performance gap between open-source and proprietary systems through data-centric methods.

2. 💡 Previous Research and New Ideas: Building on existing multimodal datasets like FineVision and LLaVA-OneVision, the paper proposes MMFineReason - a large-scale dataset with high-quality Chain-of-Thought annotations distilled from Qwen3-VL-235B-A22B-Thinking, addressing data imbalance and inconsistent reasoning quality issues.

3. ❓ Problem: The paper aims to solve the lack of high-quality reasoning data in open-source multimodal models, particularly the scarcity of STEM diagram and visual puzzle samples with consistent, long-form reasoning annotations.

4. 🛠️ Methods: The authors use a three-stage pipeline: data collection and standardization from diverse sources, CoT rationale generation via teacher model distillation, and difficulty-aware filtering for quality verification and efficient subset creation.

5. 📊 Results and Evaluation: MMFineReason models (2B/4B/8B) achieve state-of-the-art results for their size class, with the 4B model surpassing Qwen3-VL-8B-Thinking and the 8B model outperforming Qwen3-VL-30B-A3B-Thinking, while using only 7% of data achieves comparable performance to the full dataset.