2025-07-04 Papers

1/2

Paper 1

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Published: 2025-07-03

Link: http://arxiv.org/pdf/2507.02813

1. 📘 Topic and Domain: 3D scene reconstruction and understanding from sparse image views using language-embedded representations and video diffusion models.
2. 💡 Previous Research and New Ideas: Based on prior work in NeRF and Gaussian Splatting for 3D reconstruction, proposes novel integration of video diffusion models and language compression for generalized 3D scene understanding.
3. ❓ Problem: Existing methods require dense calibrated views and per-scene optimization, limiting generalization and applicability when only sparse views are available.
4. 🛠️ Methods: Introduces TriMap video diffusion to generate consistent RGB, normal and semantic maps; develops Language Quantized Compressor for efficient feature encoding; combines these to reconstruct language-embedded 3D surface fields.
5. 📊 Results and Evaluation: Achieves superior performance on LERF-OVS and ScanNet datasets, with 10.58% improvement in mIoU and 31.18% in mAcc compared to state-of-the-art methods.

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

LangScene-X: Workflow Overview Sparse Views Input (As few as 2 images) View 1 & View 2 TriMap Video Diffusion RGB Normal Semantic Progressive Multi-task Training: Web Data → 3D Consistent → Normal → Semantic Language Quantized Compressor (LQC) Vector Quantization K=2048 embeddings, D=3 Discrete feature compression Trained on COCO dataset Dense Frame Generation 3D Consistent RGB + Normal + Semantic Maps Multi-hierarchy segmentation masks Feature Compression CLIP features → LQC Discrete indices Language-Embedded Surface Fields RGB Loss Normal Loss Semantic Loss Surface alignment + 2D/3D clustering + Gaussian optimization 3D Language Field Open-ended Queries Novel view synthesis Semantic understanding Query Examples "paper towel roll", "red mug" "stuffed bear", "paper bag" → Relevancy maps rendered Training Strategy & Key Components Progressive Training 1. Web data key-frame interpolation 2. 3D consistent data (10K clips) 3. Normal annotation (200 clips) 4. Semantic annotation (300 clips) Architecture: CogVideoX DiT LQC Training Loss: λ₁Lᵣₑcₒₙ + λ₂Lₑₘb + λ₃Lₘₐₛₖ Vector quantization with stop gradient Dictionary learning for embeddings Text-guided activation alignment 500K steps on COCO dataset Surface Fields Training 5K steps: RGB + Normal loss 5K steps: + Semantic losses Progressive normal regularization 2D/3D clustering with masks DUSt3R initialization Key Innovation Unified generative paradigm Multi-modal consistency Generalizable compression No per-scene optimization Sparse view capability
Q1
1. What is the main innovation of LangScene-X compared to previous 3D scene reconstruction methods?
It requires more camera views than previous methods
It can work with as few as two input images using video diffusion
It only works on indoor scenes
Q2
2. What is the purpose of the Language Quantized Compressor (LQC) in the system?
To increase the dimensionality of language features
To generate new semantic labels
To compress high-dimensional language features into efficient discrete representations
Q3
3. What types of maps does the TriMap video diffusion model generate?
Only RGB and depth maps
RGB, normal maps, and semantic maps
Only semantic segmentation maps
1/2

Paper 2

WebSailor: Navigating Super-human Reasoning for Web Agent

Published: 2025-07-03

Link: http://arxiv.org/pdf/2507.02592

1. 📘 Topic and Domain: The paper focuses on developing WebSailor, a post-training methodology for web agents to achieve superhuman reasoning capabilities in complex information-seeking tasks.
2. 💡 Previous Research and New Ideas: Based on proprietary systems like DeepResearch that demonstrated superhuman capabilities, the paper proposes a novel approach to instill similar advanced reasoning patterns in open-source models through uncertainty reduction techniques.
3. ❓ Problem: The paper addresses the performance gap between open-source and proprietary web agents in complex information-seeking tasks, particularly their inability to systematically reduce uncertainty when navigating vast information landscapes.
4. 🛠️ Methods: The authors use a combination of structured sampling and information obfuscation to generate complex training data (SailorFog-QA), implement rejection sampling fine-tuning (RFT) cold start, and develop an efficient agentic RL algorithm called Duplicating Sampling Policy Optimization (DUPO).
5. 📊 Results and Evaluation: WebSailor significantly outperformed all open-source agents and matched proprietary agents' performance on BrowseComp-en/zh benchmarks, while also showing strong performance on simpler tasks like GAIA and XBench-DeepSearch.

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor: Training Pipeline Flow Data Synthesis SailorFog-QA • Graph sampling • Information obfuscation • Level 3 complexity Trajectory Generation Expert LRM Solutions • QwQ/DeepSeek-R1 • Action-observation traces • Reasoning reconstruction RFT Cold Start Rejection Sampling • Correct trajectories only • <32k tokens filter • >5 tool calls filter • 2k+ examples DUPO RL Training Policy Optimization • Dynamic sampling • Duplicating strategy • Group-relative advantage • 2-3x speedup WebSailor Models 3B, 7B, 32B, 72B Superior reasoning capability BrowseComp-en 12.0% (72B) BrowseComp-zh 30.1% (72B) GAIA 55.4% (72B) XBench 55.0% (72B) Key Innovations • Level 3 uncertainty tasks • Graph-based QA synthesis • Reasoning reconstruction • DUPO algorithm generates feeds into provides then produces evaluated on Task Complexity Levels Level 1: Low uncertainty (single search) Level 2: Multi-hop QA (structured path) Level 3: High uncertainty (complex reasoning) Training Specifications • ReAct framework with search/visit tools • Up to 30 tool calls per trajectory • Rule-based reward (format + answer)
Q1
1. What is the key innovation in WebSailor's training data generation that helps achieve superhuman reasoning?
Using pre-existing web browsing datasets
Generating data with deliberately high and hard-to-reduce uncertainty
Copying training patterns from proprietary systems
Q2
2. What unique challenge did the authors face when using powerful open-source LRMs to generate training trajectories?
The LRMs were too slow to generate enough training data
The LRMs could not solve complex reasoning tasks
The LRMs' verbose reasoning style could restrict the agent's ability to develop flexible strategies
Q3
3. How did WebSailor-7B's performance demonstrate the effectiveness of the proposed methodology?
It achieved better results than much larger 32B models despite its smaller size
It performed well only on simple tasks
It matched the performance of GPT-4
1/2

Paper 3

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Published: 2025-07-03

Link: http://arxiv.org/pdf/2507.02321

1. 📘 Topic and Domain: The paper focuses on improving spatial control in text-to-image diffusion models, specifically enhancing ControlNet's ability to maintain consistency between input controls and generated images.
2. 💡 Previous Research and New Ideas: Based on ControlNet and ControlNet++ which focus on late-stage alignment, this paper proposes a novel approach called InnerControl that enforces spatial consistency across all diffusion steps using intermediate features.
3. ❓ Problem: The paper addresses the limitation of existing methods that only enforce control alignment in late diffusion steps while neglecting early stages where spatial structure predominantly emerges.
4. 🛠️ Methods: The authors use lightweight convolutional networks to extract control signals from intermediate UNet features at every diffusion step, enabling explicit alignment throughout the entire generation process.
5. 📊 Results and Evaluation: The method achieved better control alignment across different tasks (depth, edges, LineArt), with 7.87% RMSE reduction compared to ControlNet++ and 10.22% compared to CtrlU for depth estimation, while maintaining competitive image quality measured by FID scores.

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

InnerControl: Training Strategy with Intermediate Features Feedback Input Image & Control Signal (depth, edges) Add Noise xt at timestep t UNet Encoder Feature Extraction ControlNet Control Block Zero Convolution UNet Decoder Intermediate Features Aggregation Network H(·,t) Lightweight CNN Predicted Control ĉspatial Diffusion Loss Ldiffusion Standard noise prediction error Reward Loss Lreward Late steps only (t ∈ [0, 200]) Alignment Loss Lalignment (Ours) All steps (t ∈ [0, 920]) Combined Training Objective Ltraining = Ldiffusion + α·Lreward + β·Lalignment Enforces consistency across entire diffusion trajectory Key Innovation Extract control signals from intermediate UNet features at ALL denoising steps (not just final steps) Generated Image Benefits of InnerControl Better Control Alignment RMSE ↓ 7.87% vs ControlNet++ Improved Image Quality Maintains FID while better control Early Stage Alignment Works on high-noise latents Stability Across Guidance Effective at both low & high scales Compatible with Existing Methods Can integrate with ControlNet++
Q1
1. What is the main limitation of ControlNet++ that InnerControl aims to address?
Poor image quality in generated outputs
Focus only on late-stage alignment while neglecting early diffusion steps
High computational cost during training
Q2
2. How does InnerControl extract control signals during the diffusion process?
Using large pretrained vision models
Through single-step image prediction
Using lightweight convolutional networks on intermediate UNet features
Q3
3. What was the improvement in RMSE for depth estimation compared to ControlNet++ at guidance scale 7.5?
7.87%
15.22%
3.94%