2026-02-02 Papers

1/2

Paper 1

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

Published: 2026-01-28

Link: http://arxiv.org/pdf/2601.20833

1. 📘 Topic and Domain: The paper focuses on autonomous scientific discovery using large language models (LLMs) in the domain of AI-assisted research automation.
2. 💡 Previous Research and New Ideas: Building on runtime-centric LLM research agents that repeatedly process literature online, the paper proposes Idea2Story, a pre-computation-driven framework that shifts literature understanding to offline knowledge graph construction.
3. ❓ Problem: The paper addresses the computational inefficiency and unreliability of existing LLM-based research agents that repeatedly read and reason over large volumes of scientific literature at runtime.
4. 🛠️ Methods: The authors use a two-stage approach: offline construction of a methodological knowledge graph from peer-reviewed papers and reviews, followed by online research generation through graph-based retrieval and composition of reusable method units.
5. 📊 Results and Evaluation: Qualitative analyses and case studies demonstrate that Idea2Story generates more coherent, methodologically grounded, and novel research patterns compared to direct LLM generation, with external evaluation consistently favoring Idea2Story's outputs.

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

Idea2Story: Research Concept to Scientific Narrative Pipeline Offline Knowledge Construction Paper Pool Construction ~13K papers from NeurIPS & ICLR Method Unit Extraction Core methodological contributions Knowledge Graph Structured method units & relations Key Components: • Anonymization (A) • Safety filtering (F) • Review artifacts • UMAP reduction • DBSCAN clustering • Meta-methods • Composition relations • Canonicalization • Graph G=(V,E) Online Research Generation User Research Idea Pattern Retrieval Multi-view scoring Review-Guided Refinement Retrieval Views: • Idea-level similarity • Domain-level relevance • Paper-level alignment Refinement Process: • LLM-based review • Generate-review-revise loop • Novelty & soundness checks • Rollback mechanism Structured Research Pattern Output Coherent, methodologically grounded research directions Ready for experimentation & paper generation Key Innovation: Pre-computation–Driven Framework Data Collection Extraction Knowledge Structure Retrieval & Composition Shifts literature understanding from online reasoning to offline knowledge construction Benefits ✓ Reduced computational cost ✓ Alleviates context window bottleneck ✓ Higher quality research patterns ✓ Reusable methodological abstractions ✓ Grounded in peer-reviewed work ✓ End-to-end paper generation
Q1
1. What is the fundamental paradigm shift that Idea2Story introduces compared to existing LLM-based research agents?
Moving from offline to online literature processing for faster runtime execution
Shifting from runtime-centric execution to pre-computation-driven knowledge graph construction
Replacing knowledge graphs with direct neural embeddings of research papers
Q2
2. In the offline knowledge construction phase, what specific artifacts does Idea2Story extract from the approximately 13,000 papers in its corpus?
Author names, affiliations, and citation networks for collaboration analysis
Raw experimental results and dataset specifications for reproducibility
Method units capturing core methodological contributions and their composition relations
Q3
3. When comparing Idea2Story's output to direct LLM generation for the e-commerce intent understanding task, what was the key difference in problem formulation?
Idea2Story reframed it as a dynamic structural reasoning process while LLM kept it as static classification
Both systems proposed identical multi-class classification approaches with different architectures
LLM introduced diffusion-based methods while Idea2Story used traditional BERT encoders
1/2

Paper 2

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Published: 2026-01-28

Link: http://arxiv.org/pdf/2601.20354

1. 📘 Topic and Domain: The paper focuses on benchmarking spatial intelligence of text-to-image models, evaluating their ability to understand and generate complex spatial relationships in images.
2. 💡 Previous Research and New Ideas: Building on existing T2I benchmarks that use short or information-sparse prompts, this work proposes long, information-dense prompts covering 10 spatial sub-domains and introduces omni-dimensional multi-choice evaluations instead of simple yes/no questions.
3. ❓ Problem: Current T2I models excel at generating objects but fail at handling complex spatial relationships like positioning, orientation, occlusion, and causal interactions, which existing benchmarks fail to adequately evaluate.
4. 🛠️ Methods: The authors create SpatialGenEval with 1,230 information-dense prompts across 25 real-world scenes, each paired with 10 multi-choice questions targeting different spatial abilities, and construct SpatialT2I dataset with 15,400 text-image pairs for fine-tuning.
5. 📊 Results and Evaluation: Evaluation of 23 SOTA models reveals spatial reasoning as the primary bottleneck (scores often below 30%), while fine-tuning with SpatialT2I yields consistent improvements (+4.2% for SD-XL, +5.7% for UniWorld-V1, +4.4% for OmniGen2).

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

SpatialGenEval: Benchmarking Workflow Domain & Scene Selection (25 scenes, 10 domains) Prompt Generation (Gemini 2.5 Pro) Long & Dense Human-in-the-loop Refinement (1,230 prompts) QA Generation (10 Multi-choice) Omni-dimensional Human QA Validation (12,300 QAs) T2I Model Generation (23 SOTA models) MLLM Evaluation (Qwen2.5-VL-72B) 5-round voting Results Analysis Performance Scores by Sub-domains SpatialT2I Dataset Construction (15,400 pairs) Model Fine-tuning (SD-XL, UniWorld, OmniGen2) Key Features • Long & Dense Prompts • 10 Spatial Sub-domains • Multi-choice Questions • Human Validation
Q1
1. What is the primary bottleneck identified in current text-to-image models according to SpatialGenEval?
Object generation and attribute binding
Spatial reasoning capabilities like comparison and occlusion
Text encoding and prompt understanding
Q2
2. How does SpatialGenEval differ from existing T2I benchmarks in its evaluation approach?
It uses 10 multi-choice questions per prompt covering all spatial sub-domains instead of simple yes/no questions
It focuses exclusively on artistic style and aesthetic quality assessment
It only evaluates closed-source commercial models
Q3
3. What unique feature does SpatialGenEval include to prevent forced guessing in evaluations?
It uses human annotators exclusively instead of automated evaluation
It includes an 'E: None' option when generated images don't match any given choices
It requires models to generate multiple images for each prompt
1/2

Paper 3

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Published: 2026-01-29

Link: http://arxiv.org/pdf/2601.21821

1. 📘 Topic and Domain: The paper focuses on multimodal reasoning in vision-language models, specifically addressing the performance gap between open-source and proprietary systems through data-centric methods.
2. 💡 Previous Research and New Ideas: Building on existing multimodal datasets like FineVision and LLaVA-OneVision, the paper proposes MMFineReason - a large-scale dataset with high-quality Chain-of-Thought annotations distilled from Qwen3-VL-235B-A22B-Thinking, addressing data imbalance and inconsistent reasoning quality issues.
3. ❓ Problem: The paper aims to solve the lack of high-quality reasoning data in open-source multimodal models, particularly the scarcity of STEM diagram and visual puzzle samples with consistent, long-form reasoning annotations.
4. 🛠️ Methods: The authors use a three-stage pipeline: data collection and standardization from diverse sources, CoT rationale generation via teacher model distillation, and difficulty-aware filtering for quality verification and efficient subset creation.
5. 📊 Results and Evaluation: MMFineReason models (2B/4B/8B) achieve state-of-the-art results for their size class, with the 4B model surpassing Qwen3-VL-8B-Thinking and the 8B model outperforming Qwen3-VL-30B-A3B-Thinking, while using only 7% of data achieves comparable performance to the full dataset.

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

MMFineReason: Data Pipeline Flow Stage 1: Data Collection & Processing FineVision BMMR Euclid30K GameQA-140K Other Sources Data Cleaning & Standardization Stage 2: Reasoning Distillation Qwen3-VL-235B-A22B Thinking (Teacher) Long CoT Generation Dense Caption Generation Template Validation Stage 3: Data Selection Quality Filtering (Length & Template) Correctness Verification Difficulty-aware Filtering N-gram De-duplication Output Datasets MMFineReason-1.8M MMFineReason-586K MMFineReason-123K SFT Training RL Training (GSPO)
Q1
1. What surprising finding did the authors discover about data efficiency in MMFineReason?
Training on just 7% of the data (123K samples) achieves performance comparable to the full 1.8M dataset
Larger models always require exponentially more data to achieve better performance
Natural images require 10x more samples than STEM diagrams for effective training
Q2
2. Which teacher model did MMFineReason use to distill high-quality reasoning annotations, and what made this choice significant?
GPT-4V, because it's the most widely used proprietary model for data annotation
Qwen3-VL-235B-A22B-Thinking, making the pipeline fully open-source without relying on closed APIs
A custom ensemble of multiple smaller models to reduce computational costs
Q3
3. What counterintuitive result did the ablation studies reveal about ultra-high resolution inputs for reasoning tasks?
2048×2048 resolution dramatically improves performance on all benchmarks by 20%
Resolution has no impact on reasoning tasks whatsoever
Ultra-high resolution (2048²) shows diminishing returns compared to 768² for reasoning, though it still helps with natural images