2025-07-04 Papers

1/2

Paper 1

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Published: 2025-07-03

Link: http://arxiv.org/pdf/2507.02813

1. 📘 Topic and Domain: 3D scene reconstruction and understanding from sparse image views using language-embedded representations and video diffusion models.

2. 💡 Previous Research and New Ideas: Based on prior work in NeRF and Gaussian Splatting for 3D reconstruction, proposes novel integration of video diffusion models and language compression for generalized 3D scene understanding.

3. ❓ Problem: Existing methods require dense calibrated views and per-scene optimization, limiting generalization and applicability when only sparse views are available.

4. 🛠️ Methods: Introduces TriMap video diffusion to generate consistent RGB, normal and semantic maps; develops Language Quantized Compressor for efficient feature encoding; combines these to reconstruct language-embedded 3D surface fields.

5. 📊 Results and Evaluation: Achieves superior performance on LERF-OVS and ScanNet datasets, with 10.58% improvement in mIoU and 31.18% in mAcc compared to state-of-the-art methods.

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

1/2

Paper 2

WebSailor: Navigating Super-human Reasoning for Web Agent

Published: 2025-07-03

Link: http://arxiv.org/pdf/2507.02592

1. 📘 Topic and Domain: The paper focuses on developing WebSailor, a post-training methodology for web agents to achieve superhuman reasoning capabilities in complex information-seeking tasks.

2. 💡 Previous Research and New Ideas: Based on proprietary systems like DeepResearch that demonstrated superhuman capabilities, the paper proposes a novel approach to instill similar advanced reasoning patterns in open-source models through uncertainty reduction techniques.

3. ❓ Problem: The paper addresses the performance gap between open-source and proprietary web agents in complex information-seeking tasks, particularly their inability to systematically reduce uncertainty when navigating vast information landscapes.

4. 🛠️ Methods: The authors use a combination of structured sampling and information obfuscation to generate complex training data (SailorFog-QA), implement rejection sampling fine-tuning (RFT) cold start, and develop an efficient agentic RL algorithm called Duplicating Sampling Policy Optimization (DUPO).

5. 📊 Results and Evaluation: WebSailor significantly outperformed all open-source agents and matched proprietary agents' performance on BrowseComp-en/zh benchmarks, while also showing strong performance on simpler tasks like GAIA and XBench-DeepSearch.

WebSailor: Navigating Super-human Reasoning for Web Agent

1/2

Paper 3

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Published: 2025-07-03

Link: http://arxiv.org/pdf/2507.02321

1. 📘 Topic and Domain: The paper focuses on improving spatial control in text-to-image diffusion models, specifically enhancing ControlNet's ability to maintain consistency between input controls and generated images.

2. 💡 Previous Research and New Ideas: Based on ControlNet and ControlNet++ which focus on late-stage alignment, this paper proposes a novel approach called InnerControl that enforces spatial consistency across all diffusion steps using intermediate features.

3. ❓ Problem: The paper addresses the limitation of existing methods that only enforce control alignment in late diffusion steps while neglecting early stages where spatial structure predominantly emerges.

4. 🛠️ Methods: The authors use lightweight convolutional networks to extract control signals from intermediate UNet features at every diffusion step, enabling explicit alignment throughout the entire generation process.

5. 📊 Results and Evaluation: The method achieved better control alignment across different tasks (depth, edges, LineArt), with 7.87% RMSE reduction compared to ControlNet++ and 10.22% compared to CtrlU for depth estimation, while maintaining competitive image quality measured by FID scores.