2025-10-28 Papers

1/2

Paper 1

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Published: 2025-10-27

Link: http://arxiv.org/pdf/2510.23607

1. 📘 Topic and Domain: A self-supervised learning framework called Concerto for joint 2D-3D representation learning in computer vision and 3D scene understanding.
2. 💡 Previous Research and New Ideas: Based on recent advances in 2D (DINOv2) and 3D (Sonata) self-supervised learning, introduces a novel approach combining intra-modal self-distillation with cross-modal joint embedding prediction.
3. ❓ Problem: Addresses the limitation that self-supervised representations learned independently from images and point clouds don't fully overlap, seeking to uncover superior spatial representations through multi-modal self-supervised learning.
4. 🛠️ Methods: Combines intra-modal self-distillation for point clouds with cross-modal joint embedding prediction from images to point clouds, using a Point Transformer V3 model pretrained on 40k point clouds and 300k images.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance across multiple benchmarks, including 80.7% mIoU on ScanNet semantic segmentation, outperforming previous methods by 14.2% and 4.8% in 2D and 3D self-supervised learning respectively.

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Concerto: Joint 2D-3D Self-Supervised Learning Workflow Data Sources 40k Point Clouds 300k Images 50k Video-lifted Point Clouds 200k Video Images Intra-Modal Self-Distillation (3D Point Cloud Branch) Point Transformer V3 Teacher-Student Paradigm Online Clustering Objective (Cross-Entropy) Cross-Modal Joint Embedding (2D-3D Prediction Branch) DINOv2 Image Encoder Camera Parameters Cosine Similarity Loss Joint 2D-3D Self-Supervised Learning Multisensory Synergy Formation Superior Spatial Representations Emerge Evaluation Protocols Linear Probing Decoder Probing Full Fine-tuning Parameter Efficiency Analysis Data Efficiency Analysis Downstream Tasks Semantic Seg. Instance Seg. Video Perception ScanNet, ScanNet200 ScanNet++, S3DIS Language Alignment Concerto Interlude Linear Projection CLIP Space Zero-shot Segmentation Key Results 77.3% mIoU (Linear Probing) 80.7% mIoU (Full Fine-tuning) +4.8% over Sonata SOTA Performance Superior to 2D+3D Concat Emergent Spatial Representations Multisensory Synergy → Superior Spatial Understanding
Q1
1. What is the main innovation of Concerto compared to previous self-supervised learning approaches?
It only focuses on 2D image representation learning
It combines intra-modal self-distillation with cross-modal joint embedding prediction
It exclusively works with 3D point cloud data
Q2
2. What inspired the development of Concerto's learning approach?
The way computers process binary data
The way robots navigate through space
The way humans learn abstract concepts through multisensory synergy
Q3
3. On the ScanNet semantic segmentation benchmark, by how much did Concerto outperform standalone SOTA 2D self-supervised models?
4.8%
14.2%
80.7%
1/2

Paper 2

ReCode: Unify Plan and Action for Universal Granularity Control

Published: 2025-10-27

Link: http://arxiv.org/pdf/2510.23564

1. 📘 Topic and Domain: The paper presents ReCode, a paradigm for Large Language Model (LLM) based agents that focuses on universal granularity control in decision-making through recursive code generation.
2. 💡 Previous Research and New Ideas: Based on previous LLM agent frameworks like ReAct and planner-based agents, it introduces a novel approach that unifies planning and action into a single code representation, treating high-level plans as abstract placeholder functions.
3. ❓ Problem: The paper addresses the limitation of current LLM-based agents that have rigid separation between high-level planning and low-level actions, preventing flexible decision granularity control across different task complexities.
4. 🛠️ Methods: ReCode uses recursive code generation where placeholder functions are progressively decomposed into finer-grained sub-functions until reaching primitive actions, implementing this through a unified variable namespace and error handling system.
5. 📊 Results and Evaluation: Across three environments (ALFWorld, ScienceWorld, WebShop), ReCode achieved significant improvements over baselines, with an average score increase of 20.9% in inference tasks and demonstrated superior data efficiency in training, using 3.7x less data while maintaining better performance.

ReCode: Unify Plan and Action for Universal Granularity Control

ReCode: Recursive Code Generation Workflow Task Initialization Rule-based Text-to-Code solve(instruction, observation) Policy Generation LLM generates code block π(current_node) Code Execution Sequential processing of code units Primitive Action? Execute Primitive run(action) in environment Recursive Expansion ReCode(task, π, env, placeholder) Generate sub-functions Context Management Unified variable namespace Hierarchical information flow Termination All primitives executed Return to parent level Hierarchical Decision Tree Root: solve() find_and_take() put_in() run('go to') run('take') run('go to') run('move') Key Features • Unified Plan-Action Representation • Dynamic Granularity Control • Recursive Decomposition • Multi-granularity Training Data • Error Handling & Self-correction • Depth Control Mechanism • Superior Performance • Data Efficiency • Cost Reduction (78.9% vs ReAct) Yes No
Q1
1. What is the key innovation of ReCode compared to previous LLM-based agent frameworks?
It uses more sophisticated language models for better performance
It unifies planning and action into a single recursive code representation
It introduces a new type of reinforcement learning algorithm
Q2
2. In the experimental results, how did ReCode demonstrate data efficiency compared to ReAct?
It required 3.7x less training data while achieving better performance
It used the same amount of data but trained 3.7x faster
It needed 3.7x more data to achieve comparable results
Q3
3. What happens when ReCode encounters a placeholder function during execution?
It skips the function and moves to the next step
It throws an error and restarts the process
It pauses execution and recursively generates the necessary implementation code
1/2

Paper 3

FARMER: Flow AutoRegressive Transformer over Pixels

Published: 2025-10-27

Link: http://arxiv.org/pdf/2510.23588

1. 📘 Topic and Domain: A novel generative AI framework called FARMER that combines normalizing flows with autoregressive transformers for high-quality image generation and likelihood estimation.
2. 💡 Previous Research and New Ideas: Based on previous work in normalizing flows (NF) and autoregressive (AR) models, introducing a new unified framework that leverages strengths of both approaches while addressing their individual limitations.
3. ❓ Problem: Addressing the challenges of modeling continuous high-dimensional image data directly in pixel space, particularly the issues of long sequences and high-dimensional spaces that make traditional AR modeling difficult.
4. 🛠️ Methods: Implements an invertible autoregressive flow to transform images into latent sequences, uses self-supervised dimension reduction to handle redundancy, applies one-step distillation for faster inference, and introduces a resampling-based classifier-free guidance algorithm.
5. 📊 Results and Evaluation: Achieves competitive performance on ImageNet 256×256 generation with FID scores of 3.60-5.40, demonstrating significant improvements over previous methods like JetFormer while providing 22× faster inference speed through distillation.

FARMER: Flow AutoRegressive Transformer over Pixels

FARMER: Flow AutoRegressive Transformer over Pixels Raw Image H×W×C Dequantize & Patchify Autoregressive Flow (AF) AF Block f₁ AF Block f₂ ... AF Block fₙ Z = F(X) = fₙ ∘ fₙ₋₁ ∘ ... ∘ f₁(X) Token Permutation Dimension Split Z^I (Informative) Z^R (Redundant) Autoregressive Transformer Transformer Block 1 Transformer Block 2 ... Transformer Block N GMM for Z^I (Per Token) GMM for Z^R (Shared) Condition c (Class Label) Enhancement Techniques One-step Distillation Resampling CFG 22× AF Reverse Speedup Enhanced Generation Quality Training Objective L = -1/(N·d) [∑log p(z^I_i|z^I_{<i}, c) + ∑log p(z^R_i|Z^I, c) + log det(∂Z/∂X)] End-to-end Optimization: AF + AR jointly trained Generated Image High-quality synthesis Exact likelihood estimation Pixel-level details Controllable generation Key Innovation: Unified NF + AR Framework • Self-supervised dimension reduction separates informative vs redundant channels • Direct pixel modeling without VAE compression • Invertible transformations preserve information • Tractable likelihood + Autoregressive expressivity in single end-to-end framework ImageNet 256×256 FID: 3.60 | IS: 269.21 Outperforms JetFormer FID reduction: 3.04 4× Inference Speedup via One-step Distillation Competitive with Latent Models
Q1
1. What is the main innovation that FARMER introduces to handle high-dimensional image data?
A new type of GAN architecture with multiple discriminators
A self-supervised dimension reduction scheme that separates informative and redundant channels
A purely autoregressive model without any flow components
Q2
2. How much speed improvement did the one-step distillation technique achieve for the NF reverse process?
4x faster
12x faster
22x faster
Q3
3. When the number of Gaussian Mixture Model (GMM) components K is set to 1 in FARMER, what does the model reduce to?
A standard Variational Autoencoder
A single Autoregressive Flow
A basic Transformer model