2025-10-28 Papers

1/2

Paper 1

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Published: 2025-10-27

Link: http://arxiv.org/pdf/2510.23607

1. 📘 Topic and Domain: A self-supervised learning framework called Concerto for joint 2D-3D representation learning in computer vision and 3D scene understanding.

2. 💡 Previous Research and New Ideas: Based on recent advances in 2D (DINOv2) and 3D (Sonata) self-supervised learning, introduces a novel approach combining intra-modal self-distillation with cross-modal joint embedding prediction.

3. ❓ Problem: Addresses the limitation that self-supervised representations learned independently from images and point clouds don't fully overlap, seeking to uncover superior spatial representations through multi-modal self-supervised learning.

4. 🛠️ Methods: Combines intra-modal self-distillation for point clouds with cross-modal joint embedding prediction from images to point clouds, using a Point Transformer V3 model pretrained on 40k point clouds and 300k images.

5. 📊 Results and Evaluation: Achieves state-of-the-art performance across multiple benchmarks, including 80.7% mIoU on ScanNet semantic segmentation, outperforming previous methods by 14.2% and 4.8% in 2D and 3D self-supervised learning respectively.

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

1/2

Paper 2

ReCode: Unify Plan and Action for Universal Granularity Control

Published: 2025-10-27

Link: http://arxiv.org/pdf/2510.23564

1. 📘 Topic and Domain: The paper presents ReCode, a paradigm for Large Language Model (LLM) based agents that focuses on universal granularity control in decision-making through recursive code generation.

2. 💡 Previous Research and New Ideas: Based on previous LLM agent frameworks like ReAct and planner-based agents, it introduces a novel approach that unifies planning and action into a single code representation, treating high-level plans as abstract placeholder functions.

3. ❓ Problem: The paper addresses the limitation of current LLM-based agents that have rigid separation between high-level planning and low-level actions, preventing flexible decision granularity control across different task complexities.

4. 🛠️ Methods: ReCode uses recursive code generation where placeholder functions are progressively decomposed into finer-grained sub-functions until reaching primitive actions, implementing this through a unified variable namespace and error handling system.

5. 📊 Results and Evaluation: Across three environments (ALFWorld, ScienceWorld, WebShop), ReCode achieved significant improvements over baselines, with an average score increase of 20.9% in inference tasks and demonstrated superior data efficiency in training, using 3.7x less data while maintaining better performance.

ReCode: Unify Plan and Action for Universal Granularity Control

1/2

Paper 3

FARMER: Flow AutoRegressive Transformer over Pixels

Published: 2025-10-27

Link: http://arxiv.org/pdf/2510.23588

1. 📘 Topic and Domain: A novel generative AI framework called FARMER that combines normalizing flows with autoregressive transformers for high-quality image generation and likelihood estimation.

2. 💡 Previous Research and New Ideas: Based on previous work in normalizing flows (NF) and autoregressive (AR) models, introducing a new unified framework that leverages strengths of both approaches while addressing their individual limitations.

3. ❓ Problem: Addressing the challenges of modeling continuous high-dimensional image data directly in pixel space, particularly the issues of long sequences and high-dimensional spaces that make traditional AR modeling difficult.

4. 🛠️ Methods: Implements an invertible autoregressive flow to transform images into latent sequences, uses self-supervised dimension reduction to handle redundancy, applies one-step distillation for faster inference, and introduces a resampling-based classifier-free guidance algorithm.

5. 📊 Results and Evaluation: Achieves competitive performance on ImageNet 256×256 generation with FID scores of 3.60-5.40, demonstrating significant improvements over previous methods like JetFormer while providing 22× faster inference speed through distillation.