2025-10-15 Papers

1/2

Paper 1

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

Published: 2025-10-14

Link: http://arxiv.org/pdf/2510.12276

1. 📘 Topic and Domain: The paper focuses on improving spatial awareness in Vision-Language-Action (VLA) models for robotic manipulation through implicit spatial representation alignment.

2. 💡 Previous Research and New Ideas: Based on previous VLA models that rely on explicit 3D sensor inputs or depth estimators, this paper proposes a novel approach of implicitly developing spatial comprehension without relying on explicit 3D data.

3. ❓ Problem: The paper aims to solve the challenge of enabling VLA models to develop accurate spatial awareness without depending on explicit 3D sensor information or depth estimators, which are often unreliable or unavailable.

4. 🛠️ Methods: The authors introduce Spatial Forcing (SF), which aligns intermediate visual embeddings of VLAs with geometric representations from pretrained 3D foundation models through cosine similarity scoring and representation alignment.

5. 📊 Results and Evaluation: SF achieved state-of-the-art results on LIBERO and RoboTwin benchmarks, accelerated training by up to 3.8x, improved data efficiency, and demonstrated superior performance in both simulated and real-world robotic tasks.

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

1/2

Paper 2

Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

Published: 2025-10-14

Link: http://arxiv.org/pdf/2510.12586

1. 📘 Topic and Domain: Pixel-space generative modeling for image synthesis using diffusion and consistency models.

2. 💡 Previous Research and New Ideas: Based on self-supervised learning approaches and prior diffusion models, proposes a novel two-stage training framework with self-supervised pre-training instead of relying on VAEs.

3. ❓ Problem: Addressing the persistent performance and efficiency gap between pixel-space generative models and their latent-space counterparts.

4. 🛠️ Methods: Uses a two-stage approach: pre-training encoders to capture semantics from clean images while aligning them along deterministic sampling trajectories, then fine-tuning with a randomly initialized decoder end-to-end.

5. 📊 Results and Evaluation: Achieved FID scores of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 NFEs for diffusion models, and 8.82 FID for one-step consistency model generation on ImageNet-256, surpassing previous pixel-space methods.

Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

1/2

Paper 3

Scaling Language-Centric Omnimodal Representation Learning

Published: 2025-10-13

Link: http://arxiv.org/pdf/2510.11693

1. 📘 Topic and Domain: Language-centric omnimodal representation learning in multimodal large language models (MLLMs), focusing on cross-modal alignment and embedding capabilities.

2. 💡 Previous Research and New Ideas: Based on previous CLIP-style and MLLM-based embedding approaches, proposing that MLLMs achieve implicit cross-modal alignment during generative pretraining, allowing for lightweight contrastive learning refinement.

3. ❓ Problem: Understanding why MLLM-based embedding approaches outperform traditional CLIP-based models and developing more efficient methods for cross-modal representation learning.

4. 🛠️ Methods: Developed LCO-EMB framework using language-centric paired data for contrastive learning refinement, analyzed through anisotropy and kernel similarity studies, and validated on various benchmarks.

5. 📊 Results and Evaluation: Achieved state-of-the-art performance across diverse modalities and benchmarks, discovered a Generation-Representation Scaling Law showing representation capabilities scale with generative abilities, and validated findings on a challenging visual-document retrieval task.