2025-10-15 Papers

1/2

Paper 1

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

Published: 2025-10-14

Link: http://arxiv.org/pdf/2510.12276

1. 📘 Topic and Domain: The paper focuses on improving spatial awareness in Vision-Language-Action (VLA) models for robotic manipulation through implicit spatial representation alignment.
2. 💡 Previous Research and New Ideas: Based on previous VLA models that rely on explicit 3D sensor inputs or depth estimators, this paper proposes a novel approach of implicitly developing spatial comprehension without relying on explicit 3D data.
3. ❓ Problem: The paper aims to solve the challenge of enabling VLA models to develop accurate spatial awareness without depending on explicit 3D sensor information or depth estimators, which are often unreliable or unavailable.
4. 🛠️ Methods: The authors introduce Spatial Forcing (SF), which aligns intermediate visual embeddings of VLAs with geometric representations from pretrained 3D foundation models through cosine similarity scoring and representation alignment.
5. 📊 Results and Evaluation: SF achieved state-of-the-art results on LIBERO and RoboTwin benchmarks, accelerated training by up to 3.8x, improved data efficiency, and demonstrated superior performance in both simulated and real-world robotic tasks.

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

Spatial Forcing (SF) Methodology Flow Problem Analysis Depth Probing Experiment VLA lacks spatial understanding in embeddings 3D Foundation Model VGGT Processing Multi-view Images → Spatial Representations f3D(I) VLA Processing Vision Tokens Language Tokens Intermediate Visual Embeddings xVi Alignment Process Batch Normalization + MLP Cosine Similarity Alignment Spatial Forcing Alignment Loss Lalign = -1/N Σ S[MLP·Γ(xVi), f3Di(I) + E] where S[·,·] is cosine similarity Combined Training Loss LSF = Laction + α·Lalign Performance Gains LIBERO: 98.5% SR RoboTwin: SOTA Real-world: +47.5% Training Efficiency 3.8× faster convergence Same success rates Data Efficiency 5.9× more efficient 75.8% SR with only 5% data Inference No additional computational overhead Key Innovation Implicit spatial comprehension without explicit 3D inputs or depth estimators Aligns intermediate VLA embeddings with pretrained 3D foundation model representations
Q1
1. What is the main limitation of existing 3D VLA approaches that Spatial Forcing aims to overcome?
High computational cost of processing 3D data
Dependence on unreliable depth sensors and incomplete datasets
Inability to handle multiple camera views
Q2
2. How does Spatial Forcing achieve spatial awareness in VLA models?
By adding extra 3D sensors to the robot
By training a separate depth estimation network
By aligning visual embeddings with pretrained 3D foundation model representations
Q3
3. What significant performance improvement did Spatial Forcing demonstrate in training efficiency?
Reduced training time by 3.8x
Improved accuracy by 3.8%
Reduced memory usage by 3.8x
1/2

Paper 2

Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

Published: 2025-10-14

Link: http://arxiv.org/pdf/2510.12586

1. 📘 Topic and Domain: Pixel-space generative modeling for image synthesis using diffusion and consistency models.
2. 💡 Previous Research and New Ideas: Based on self-supervised learning approaches and prior diffusion models, proposes a novel two-stage training framework with self-supervised pre-training instead of relying on VAEs.
3. ❓ Problem: Addressing the persistent performance and efficiency gap between pixel-space generative models and their latent-space counterparts.
4. 🛠️ Methods: Uses a two-stage approach: pre-training encoders to capture semantics from clean images while aligning them along deterministic sampling trajectories, then fine-tuning with a randomly initialized decoder end-to-end.
5. 📊 Results and Evaluation: Achieved FID scores of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 NFEs for diffusion models, and 8.82 FID for one-step consistency model generation on ImageNet-256, surpassing previous pixel-space methods.

Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

EPG: End-to-End Pixel Space Generative Modeling Framework Stage 1: Self-Supervised Pre-training Clean Images Data Augmentation (y₁, y₂) Noise (xₜₙ, xₜₙ₋₁) Encoder Eθ (Online) Encoder Eθ₋ (Momentum) Encoder Esg(θ) (Stop Grad) Projector Lθ Projector Lθ₋ Contrastive Loss Representation Consistency Loss Stage 2: End-to-End Fine-tuning Pre-trained Encoder Eθ Random Init Decoder Dθ Complete Model fθ Diffusion Training (Equation 1) Consistency Training (Equation 5 + 9) Key Technical Components Representation Learning Visual semantics from clean & noisy images Temperature Scheduling τ(t) = τ₁*(1-t) + τ₂*t Stable training ODE Trajectory Alignment Points on same trajectory maintain consistency Auxiliary Contrastive Loss For consistency models Equation 9 Achievements Diffusion Model FID 2.04 (ImageNet-256) 75 NFE, SOTA pixel-space Consistency Model FID 8.82 (ImageNet-256) Single-step generation High Resolution FID 2.35 (ImageNet-512) Efficient scaling Training Efficiency Competitive with VAE-based No external models Transfer
Q1
1. What is the main innovation in the paper's training framework compared to traditional approaches?
Using VAEs for latent space compression
Two-stage training with self-supervised pre-training of encoders
Single-stage end-to-end training with larger models
Q2
2. What is the most impressive achievement of the paper's consistency model variant?
Achieving 2.04 FID score on ImageNet-256
Successfully training without VAEs or pre-trained diffusion models
Using only 32 sampling steps for generation
Q3
3. How does the paper's approach improve the training efficiency?
By using more powerful GPUs and larger batch sizes
By compressing images into latent space representations
By decomposing training into semantic learning and pixel generation stages
1/2

Paper 3

Scaling Language-Centric Omnimodal Representation Learning

Published: 2025-10-13

Link: http://arxiv.org/pdf/2510.11693

1. 📘 Topic and Domain: Language-centric omnimodal representation learning in multimodal large language models (MLLMs), focusing on cross-modal alignment and embedding capabilities.
2. 💡 Previous Research and New Ideas: Based on previous CLIP-style and MLLM-based embedding approaches, proposing that MLLMs achieve implicit cross-modal alignment during generative pretraining, allowing for lightweight contrastive learning refinement.
3. ❓ Problem: Understanding why MLLM-based embedding approaches outperform traditional CLIP-based models and developing more efficient methods for cross-modal representation learning.
4. 🛠️ Methods: Developed LCO-EMB framework using language-centric paired data for contrastive learning refinement, analyzed through anisotropy and kernel similarity studies, and validated on various benchmarks.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance across diverse modalities and benchmarks, discovered a Generation-Representation Scaling Law showing representation capabilities scale with generative abilities, and validated findings on a challenging visual-document retrieval task.

Scaling Language-Centric Omnimodal Representation Learning

LCO-EMB: Language-Centric Omnimodal Representation Learning Analysis Phase Anisotropy Analysis Kernel Similarity Cross-modal Alignment MLLM Backbone LLaVA-Next / Qwen2.5-VL Qwen2.5-Omni Pretrained Alignment Text-only CL LoRA Fine-tuning NLI / Scale-1M Minimal Perturbation Multimodal Refinement 94k Synthetic Pairs Task Space Calibration Optional Enhancement Training Data Sources • MNLI + SNLI (276k) • Scale-1M (1M pairs) • Visual Documents • Retrieval & Compositionality • Multilingual Data • Synthetic Samples Generation-Representation Scaling Law Generative Quality ∝ Representation Performance PAC-Bayesian Bound: L_pop ≤ log(N) - I_P(X;Y) + ε_P Theoretical Justification Evaluation Benchmarks • MIEB-Lite (51 tasks) • Audio-Text: AudioCaps, Clotho • Video-Text: MSR-VTT, ActivityNet • SeaDoc (Cross-lingual Document Retrieval) Modality Encoders Vision / Audio (Frozen) Projector Alignment Layer (Frozen) Language Decoder LLM Backbone (LoRA Tuned) Embeddings Unified Space Similarity Matching Key Results & Achievements • State-of-the-art on MIEB-Lite: 68.8% (LCO-EMB-Omni 7B) • 21× less training data than competing methods • Text-only variants outperform advanced baselines • Discovers latent cross-modal alignment in MLLMs • Establishes Generation-Representation Scaling Law • Generalizes across vision, audio, and video modalities Theoretical Foundation Generative Bottleneck: log(N) - I_P(X;Y) Optimization Inefficiency: ε_P (minimized by strong prior) Fine-tuning Cost: √(KL(Q||P) + log(1/δ))/2n LoRA Justification: Keeps KL(Q||P) small Key Innovations • Language-centric training paradigm • Minimal multimodal data requirement • Preservation of generative capabilities • Cross-modal generalization Empirical Validation • Multiple MLLM backbones tested • Comprehensive benchmark evaluation • SeaDoc challenge task validation • Ablation studies on LoRA settings Impact & Applications • Efficient multimodal representation • Low-resource language support • Document understanding advances • Scalable training paradigm
Q1
1. What is the key insight about MLLMs that the paper discovers?
MLLMs require extensive contrastive learning to achieve cross-modal alignment
MLLMs achieve implicit cross-modal alignment during generative pretraining
MLLMs cannot perform as well as CLIP-based models in embedding tasks
Q2
2. What novel scaling law does the paper identify?
Model size directly correlates with embedding quality
Training data size determines representation capabilities
Representational capabilities scale positively with generative abilities
Q3
3. Why does the paper's LCO-EMB framework use LoRA for fine-tuning?
To reduce computational costs during training
To preserve the latent cross-modal alignment while enhancing representation capability
To enable training on larger batch sizes