2026-03-17 Papers

1/2

Paper 1

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Published: 2026-03-16

Link: http://arxiv.org/pdf/2603.15594

1. 📘 Topic and Domain: The paper presents OpenSeeker, a fully open-source search agent for web-based information retrieval in the domain of Large Language Model agents.
2. 💡 Previous Research and New Ideas: Building on ReAct paradigm and existing search agents dominated by corporations, the paper proposes fact-grounded scalable controllable QA synthesis and denoised trajectory synthesis to democratize high-quality training data.
3. ❓ Problem: The paper addresses the lack of transparent, high-quality training data for search agents, which has been monopolized by industrial giants and hindered open-source community progress.
4. 🛠️ Methods: The authors use graph expansion and entity obfuscation to generate complex multi-hop QA pairs, and employ retrospective summarization during trajectory synthesis to denoise web content while training on raw data.
5. 📊 Results and Evaluation: OpenSeeker achieves state-of-the-art performance among open-source agents across four benchmarks (BrowseComp: 29.5%, BrowseComp-ZH: 48.4%, xbench: 74.0%, WideSearch: 59.4%) using only 11.7k samples and simple SFT, even surpassing some industrial models trained with extensive resources.

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

OpenSeeker: Training Data Synthesis Pipeline Fact-Grounded Scalable Controllable QA Synthesis Web Corpus Graph Expansion Entity Extraction Question Generation Entity Obfuscation QA Verifier Difficulty Check Solvability Check Final QA Denoised Trajectory Synthesis Question Dynamic Context Denoising: Summarized History + Raw Recent Reasoning Tool Call Raw Tool Response Summarized Response Training/Inference: Learn to predict expert decisions while conditioning on original raw trajectory OpenSeeker
Q1
1. What unique approach does OpenSeeker use to generate complex questions that require multi-hop reasoning?
It uses GPT-5 to directly generate difficult questions from scratch
It reverse-engineers the web graph through topological expansion and entity obfuscation
It crowdsources questions from human annotators on mechanical turk
Q2
2. How does OpenSeeker's training data volume compare to its performance achievements?
It uses 147k samples like MiroThinker but achieves worse results
It requires over 100k samples to match industrial baselines
It achieves state-of-the-art results with only 11.7k synthesized samples
Q3
3. What is the key asymmetry in OpenSeeker's denoised trajectory synthesis method?
The teacher model generates on summarized context while the student trains on raw, noisy trajectories
The training uses Chinese data while inference uses English data
The model is trained with reinforcement learning but deployed with supervised learning
1/2

Paper 2

Grounding World Simulation Models in a Real-World Metropolis

Published: 2026-03-16

Link: http://arxiv.org/pdf/2603.15583

1. 📘 Topic and Domain: The paper presents a real-world grounded video world simulation model that generates city-scale videos anchored in actual urban environments, specifically Seoul.
2. 💡 Previous Research and New Ideas: Building on pretrained video world models and diffusion transformers, the paper introduces retrieval-augmented generation using street-view images to ground video generation in real locations rather than imagined environments.
3. ❓ Problem: The paper addresses the limitation that existing world models operate in entirely imagined environments, proposing to generate temporally consistent, spatially faithful videos grounded in actual physical locations.
4. 🛠️ Methods: The authors use cross-temporal pairing to handle temporal misalignment, synthetic urban datasets for trajectory diversity, view interpolation for sparse data, and a Virtual Lookahead Sink mechanism for long-horizon stability.
5. 📊 Results and Evaluation: SWM outperforms existing world models on benchmarks across Seoul, Busan, and Ann Arbor in visual quality, camera adherence, temporal coherence, and structural fidelity, maintaining stable generation over trajectories reaching hundreds of meters.

Grounding World Simulation Models in a Real-World Metropolis

Seoul World Model: Method Overview Data Construction Real Street-View Data • 440K Seoul panoramas • GPS coordinates • Cross-temporal pairing • View interpolation • Depth estimation (DA3) Synthetic Urban Data • 12.7K CARLA videos • Pedestrian trajectories • Vehicle trajectories • Free-camera paths • Street-view references Driving Videos • Waymo dataset • Real driving footage • Scenario diversity Model Architecture & Generation User Inputs • Starting location • Camera motion • Text prompt Street-View Retrieval • Geo-indexed database • Nearest-neighbor search • Depth-based filtering Virtual Lookahead Sink • Dynamic anchoring • Future frame retrieval • Error accumulation fix Geometric Referencing • Depth-based reprojection • Spatial layout cues • Warped video generation Semantic Referencing • Original reference injection • Appearance preservation • Multi-reference attention Autoregressive Diffusion Transformer • Fine-tuned Cosmos-Predict2.5-2B • Teacher/Self-Forcing variants • Chunk-based generation Generated Output • City-scale videos • Long trajectories • Real-world grounded
Q1
1. What is the key innovation of the Virtual Lookahead Sink mechanism in SWM?
It uses the first frame of the video as a permanent anchor throughout generation
It dynamically retrieves nearby street-view images as future destinations to prevent error accumulation
It compresses the entire video history into a single latent representation
Q2
2. Why does SWM employ cross-temporal pairing during training?
To ensure references and targets have identical dynamic objects like vehicles and pedestrians
To increase the temporal resolution of street-view captures from 5-20m to 1m intervals
To teach the model to distinguish persistent structures from transient content by using references from different timestamps
Q3
3. How does SWM handle the challenge of sparse street-view data captured at irregular intervals?
By using an intermittent freeze-frame strategy that repeats keyframes to match the 3D VAE's temporal compression
By training exclusively on synthetic CARLA data with continuous video sequences
By limiting generation to only 12 frames at a time to avoid temporal gaps
1/2

Paper 3

AI Can Learn Scientific Taste

Published: 2026-03-15

Link: http://arxiv.org/pdf/2603.14473

1. 📘 Topic and Domain: The paper focuses on training AI models to develop scientific taste - the ability to judge and propose research ideas with high potential impact - in the domain of AI for scientific research.
2. 💡 Previous Research and New Ideas: Building on reinforcement learning paradigms like RLHF and RLVR, the paper proposes Reinforcement Learning from Community Feedback (RLCF), a novel approach that uses large-scale citation data as community signals to train models for scientific judgment and ideation.
3. ❓ Problem: The paper addresses the gap between current AI scientists' technical capabilities and human scientists' ability to identify high-impact research directions, aiming to enhance AI's capacity for scientific taste beyond just executing research tasks.
4. 🛠️ Methods: The authors use GRPO to train Scientific Judge on 700K citation-based paper pairs for preference modeling, then employ Comparison-Based GRPO with Scientific Judge as a reward model to train Scientific Thinker for generating high-impact research ideas.
5. 📊 Results and Evaluation: Scientific Judge achieves 80.6% accuracy, outperforming GPT-4o and Gemini 3 Pro, and generalizes across time, fields, and peer-review metrics; Scientific Thinker achieves 81.5% win rate against baseline models in proposing higher-impact research ideas.

AI Can Learn Scientific Taste

AI Can Learn Scientific Taste: Method Flow Stage 1 Community Feedback Collection Citations SciJudgeBench 700K field & time-matched pairs Stage 2: Preference Modeling Scientific Judge GRPO Training Predicts higher impact papers Generalizes across time & fields Stage 3: Preference Alignment Scientific Thinker Comparison-Based GRPO Uses Scientific Judge as reward Proposes high-impact ideas Reward Signal Key Components: RLCF Framework Pairwise Comparison GRPO Algorithm Generative Reward Model Key Results: Outperforms GPT-5.2 & Gemini 3 Pro 81.5% win rate for idea generation
Q1
1. What philosophical foundation does the paper draw upon to justify that scientific taste can be learned from community feedback?
Plato's theory of ideal forms, suggesting that perfect scientific judgment exists independently of human perception
Hume and Kant's theories that taste emerges from qualified community judgment rather than individual preference
Aristotle's empiricism, arguing that scientific taste is purely derived from observational data
Q2
2. How does the paper's Comparison-Based GRPO method calculate rewards for Scientific Thinker during training?
By directly scoring each generated idea using a numerical impact prediction model
By conducting round-robin tournaments where each idea's win rate against other sampled ideas becomes its reward
By comparing each idea only to a fixed baseline policy's output
Q3
3. What surprising generalization capability did Scientific Judge demonstrate beyond citation prediction?
It could predict stock market trends based on company research papers
It could generate novel mathematical proofs when trained only on abstracts
It could accurately predict peer review scores despite being trained only on citation data