2026-04-01 Papers

1/2

Paper 1

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28032

1. 📘 Topic and Domain: The paper presents CARLA-Air, a unified simulation infrastructure for air-ground embodied intelligence research that integrates aerial drone and ground vehicle simulation within a single physically coherent environment.
2. 💡 Previous Research and New Ideas: Based on CARLA (urban driving simulator) and AirSim (UAV simulator), the new idea is integrating both backends within a single Unreal Engine process to overcome the limitations of domain-segregated simulators and bridge-based co-simulation approaches.
3. ❓ Problem: The paper solves the problem that existing open-source simulators cannot jointly model aerial and ground agents with strict spatial-temporal consistency—driving simulators lack aerial dynamics, UAV simulators lack realistic ground scenes, and co-simulation introduces synchronization overhead.
4. 🛠️ Methods: The method uses a composition-based design that resolves the UE4 single-game-mode constraint by having CARLAAirGameMode inherit CARLA's ground simulation while composing AirSim's aerial flight actor as a regular world entity, preserving both native APIs within a shared physics tick and rendering pipeline.
5. 📊 Results and Evaluation: The platform achieves ~20 FPS under joint air-ground workloads, maintains stable operation over 3-hour endurance tests with zero crashes across 357 reset cycles, delivers <0.5ms API latency, and was validated through five representative workflows including precision landing, multi-modal dataset collection, and RL training.
1. 📘 主题与领域: 该论文介绍了CARLA-Air,一个用于空地具身智能研究的统一仿真基础设施,能够在单一物理一致的环境中集成空中无人机和地面车辆仿真。
2. 💡 先前研究与新思路: 基于CARLA(城市场景驾驶仿真器)和AirSim(无人机仿真器),新思路是将两个后端集成到单一Unreal Engine进程中,以克服领域分离仿真器的局限性和桥接协同仿真的不足。
3. ❓ 问题: 该论文解决了现有开源仿真器无法在严格时空一致性下联合建模空中和地面智能体的问题——驾驶仿真器缺少空中动力学,无人机仿真器缺少真实地面场景,而协同仿真引入同步开销。
4. 🛠️ 方法: 该方法采用基于组合的设计,通过让CARLAAirGameMode继承CARLA的地面仿真功能,同时将AirSim的空中飞行参与者作为常规世界实体进行组合,从而解决UE4单一游戏模式约束问题,在共享物理 tick 和渲染管线中保留两个原生API。
5. 📊 结果与评估: 该平台在联合空地工作负载下达到约20 FPS,在3小时持久性测试中保持稳定运行,357次重置循环中零崩溃,API延迟<0.5ms,并通过五个代表性工作流程进行了验证,包括精确着陆、多模态数据集采集和强化学习训练。

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

CARLA-Air: Unified Air-Ground Embodied Intelligence Infrastructure PROBLEM Domain-segregated simulators: CARLA (ground) + AirSim (aerial) → Bridge-based co-simulation introduces synchronization overhead, communication latency, and cannot guarantee spatial-temporal consistency SYSTEM ARCHITECTURE Single UE4 Process (Eliminates inter-process serialization overhead) Ground Plugin (CARLA-based) • Urban traffic simulation Aerial Plugin (AirSim-based) • Multirotor dynamics CARLAAirGameMode Inherits CARLA GameMode + Composes AirSim Flight Actor via BEGINPLAY phase Shared Physics Tick Shared Rendering Pipeline Strict Spatial-Temporal Consistency CARLA Python API Ground sensing & control AirSim Python API Aerial sensing & flight Key Capabilities Single Process API Compatible 18 Sensor Modalities ROS 2 Support Asset Pipeline Extensible • Rule-compliant traffic flow • Socially-aware pedestrians • Aerodynamically consistent UAV dynamics Performance Validation ~20 FPS Joint workload ≈19.8 FPS moderate <0.5 ms Data transfer vs 1-5ms bridge IPC 3-hour stability: 0 crashes, 0 memory leaks, 357 reset cycles VRAM: 3,878 MiB (76% budget available for GPU-based training) Representative Workflows & Research Directions W1: Precision Landing Air-ground cooperation <0.5m landing error W2: VLN/VLA Data Embodied navigation Vision-language grounding W3: Multi-Modal Dataset 12-stream sync capture ≤1-tick alignment W4/W5: Cross-View & RL Perception & policy training 14/14 weather presets Research Directions Supported • Air-Ground Cooperation • Embodied Navigation & VLA • Multi-Modal Perception • RL Policy Training Coordinate System Mapping UE4 (LHD, cm) ↔ NED (RHD, m)
Q1
1. How does CARLA-Air resolve the UE4 single-game-mode constraint that prevents running both CARLA and AirSim simultaneously?
Through composition-based design where CARLAAirGameMode inherits ground simulation while composing the aerial flight actor as a world entity
By using bridge-based co-simulation with ROS 2 message passing between separate processes
By splitting the simulation into multiple independent UE4 instances
Q2
2. What is the approximate frame rate achieved by CARLA-Air under moderate joint air-ground workloads?
Approximately 20 FPS
Approximately 40 FPS
Approximately 60 FPS
Q3
3. Which of the following represents the maximum number of sensor modalities that CARLA-Air can synchronously capture across aerial and ground platforms?
Up to 18 sensor modalities
Up to 8 sensor modalities
Up to 12 sensor modalities
1/2

Paper 2

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Published: 2026-03-29

Link: http://arxiv.org/pdf/2603.27538

1. 📘 Topic and Domain: The paper addresses native multimodal modeling, proposing a unified framework that represents text, vision, and audio within a shared discrete token space for autoregressive generation.
2. 💡 Previous Research and New Ideas: Based on the Next-Token Prediction (NTP) paradigm and prior multimodal systems that treat non-linguistic modalities as external attachments, the paper introduces the Discrete Native Autoregression (DiNA) paradigm and dNaViT (Discrete Native Resolution Vision Transformer) for unified multimodal understanding and generation.
3. ❓ Problem: The paper solves the challenge of representing non-linguistic modalities (vision and audio) within a discrete token space, overcoming the performance ceiling of discrete visual modeling and reconciling the traditionally conflicting objectives of understanding and generation.
4. 🛠️ Methods: The paper employs Semantic-and-Aligned Encoders (SAE) for semantic completeness, Residual Vector Quantization (RVQ) for hierarchical discrete tokenization, a modality-agnostic MoE backbone, and internal linguistic guidance for audio generation with parallel/serial strategies.
5. 📊 Results and Evaluation: LongCat-Next achieves competitive performance on visual understanding benchmarks (MMMU 70.6, MathVista 83.1), strong visual generation (GenEval 84.44), state-of-the-art audio capabilities, and maintains robust text capabilities, outperforming unified multimodal models while reconciling understanding and generation objectives.
1. 📘 主题与领域: 该论文聚焦于原生多模态建模,提出一个统一框架,将文本、视觉和音频表示在共享的离散 token 空间中,实现自回归生成。
2. 💡 先前研究与新思路: 基于 Next-Token Prediction(NTP)范式和将非语言模态作为外部附件处理的多模态系统,本文提出离散原生自回归(DiNA)范式和 dNaViT(离散原生分辨率视觉 Transformer),实现统一的多模态理解与生成。
3. ❓ 问题: 本文解决了如何将非语言模态(视觉和音频)有效表示在离散 token 空间中的挑战,突破了离散视觉建模的性能上限,协调了传统上相互冲突的理解与生成目标。
4. 🛠️ 方法: 论文采用语义对齐编码器(SAE)确保语义完整性,使用残差向量量化(RVQ)进行分层离散 token 化,结合模态无关的 MoE 主干网络,以及用于音频生成的内部语言引导(并行/串行策略)。
5. 📊 结果与评估: LongCat-Next 在视觉理解基准(MMMU 70.6、MathVista 83.1)上表现竞争力,在视觉生成(GenEval 84.44)上表现强劲,音频能力达到 SOTA,并保持强大的文本能力,在协调理解与生成目标的同时超越了统一多模态模型。

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

LongCat-Next: Methodology Flow Chart DiNA: Discrete Native Autoregression Paradigm INPUT MODALITIES Vision (Image) Continuous Visual Signal Any Resolution Audio (Speech) Continuous Audio Waveform 12.5 Hz Token Rate Text (Language) Discrete Text Tokens Subword Tokenization TOKENIZERS dNaViT: Discrete Native ViT • Semantic-and-Aligned Encoder (SAE) Qwen2.5-ViT encoder • Residual Vector Quantization (RVQ) 8-level cascaded codebooks • Any-Resolution Processing Up to 28× compression ratio Audio Tokenizer • Whisper Encoder Semantic & paralinguistic features • 8-Layer RVQ Codebook sizes: 8k,4k,2k,1k,1k,1k,1k,1k • Flow Matching Decoder High-fidelity reconstruction Text Tokenizer • Standard Subword Tokenization Byte-level BPE • Special Tokens AS, AE, TE for audio alignment • Native Discrete Representation Ready for autoregressive modeling UNIFIED DISCRETE TOKEN SPACE Vision Tokens + Audio Tokens + Text Tokens → Shared Embedding Space Multi-level summation for vision • Text-guided audio modality • Internal linguistic guidance LANGUAGE MODEL BACKBONE LongCat-Flash-Lite (Modality-Agnostic MoE) • 68.5B Total Parameters • ~3B Activated Parameters • Zero-Expert & Shortcut MoE • Unified Autoregressive Objective • Next-Token Prediction (NTP) • Single Modality-Agnostic Pathway • Multi-level DepthTransformer • Exponential representation space • Efficient parallel decoding OUTPUTS Visual Understanding OCR, VQA, Math, Reasoning Visual Generation Text-to-Image, Any Resolution Audio Processing ASR, TTS, Voice Clone TRAINING PIPELINE Phase I Tokenizer Training Phase II Pre-Align Phase II Pre-train Phase II Mid-train Phase II SFT Key Design Principles 1. Semantic Completeness Preserve info for understanding and generation 2. Residual Vector Quantization Hierarchical discrete tokens Minimize information loss 3. Native Multimodality Unified token space for all modalities Single autoregressive objective
Q1
1. What paradigm does LongCat-Next introduce to represent all modalities within a shared discrete token space?
Discrete Native Autoregression (DiNA)
Continuous Multimodal Fusion
Sequential Modality Processing
Q2
2. Which component enables LongCat-Next to perform tokenization and de-tokenization at arbitrary resolutions for visual signals?
dNaViT (Discrete Native Resolution Vision Transformer)
Traditional CNN-based Encoder
Fixed-Resolution ViT Tokenizer
Q3
3. In the DiNA framework, what technique does the vision tokenizer use to preserve both high-level semantics and fine-grained visual details?
Residual Vector Quantization (RVQ)
Single-Level Vector Quantization
Continuous Feature Projection
1/2

Paper 3

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28088

1. 📘 Topic and Domain: The paper focuses on multimodal text-to-image generation, specifically developing an agent-native framework called GEMS to enhance generation quality for complex instructions and specialized downstream tasks.
2. 💡 Previous Research and New Ideas: Based on advanced agent frameworks (Claude Code, OpenClaw) and inference-time scaling methods, the paper introduces a novel combination of Agent Loop (iterative closed-loop optimization), Agent Memory (hierarchical compression for trajectory-level persistence), and Agent Skill (on-demand domain-specific expertise loading).
3. ❓ Problem: Current multimodal generation models struggle with complex, multi-faceted instructions and specialized downstream applications, representing a "long-tail" challenge where general-purpose capabilities reach their limits.
4. 🛠️ Methods: GEMS employs three core components: Agent Loop uses Planner, Decomposer, Generator, Verifier, and Refiner modules for iterative refinement; Agent Memory uses hierarchical compression to store factual artifacts and distilled experiences; Agent Skill provides extensible domain-specific expertise with on-demand loading.
5. 📊 Results and Evaluation: GEMS enables the 6B Z-Image-Turbo model to surpass state-of-the-art Nano Banana 2 on GenEval2, achieving average performance gains of 14.22 on mainstream benchmarks and 14.03 on downstream tasks, while maintaining superior efficiency with fewer average images generated.
1. 📘 主题与领域: 该论文聚焦于多模态文生图领域,旨在开发名为GEMS的智能体原生框架,以提升复杂指令和专业化下游任务的生成质量。
2. 💡 先前研究与新思路: 基于先进智能体框架(Claude Code、OpenClaw)和推理时扩展方法,提出创新性地融合智能体循环(迭代闭环优化)、智能体记忆(层级压缩的轨迹级持久化)和智能体技能(按需加载的领域专业知识)。
3. ❓ 问题: 当前多模态生成模型在处理复杂多面指令和专业化下游应用时存在困难,这代表了通用能力达到极限的"长尾"挑战。
4. 🛠️ 方法: GEMS采用三个核心组件:智能体循环使用规划器、分解器、生成器、验证器和优化器模块进行迭代优化;智能体记忆通过层级压缩存储事实性产物和提炼经验;智能体技能提供可扩展的领域专业知识并支持按需加载。
5. 📊 结果与评估: GEMS使6B参数模型Z-Image-Turbo在GenEval2上超越最先进的Nano Banana 2,在主流基准测试中平均提升14.22分,在下游任务中平均提升14.03分,同时以更少的平均生成图像数保持更高的效率。 ===

GEMS: Agent-Native Multimodal Generation with Memory and Skills

GEMS: Agent-Native Multimodal Generation Workflow User Prompt Planner (Fplan) Strategic Entry Point Agent Skill Domain-Specific Expertise On-Demand Loading Creative Drawing Aesthetic Drawing Text Rendering Spatial Intelligence Skill Selection Enhanced Prompt Decomposer (Fdec) Partition into Atomic Criteria C={c₁,c₂,...,cₙ} Atomic Criteria Binary (yes/no) probes Generator (Fgen) Model-Agnostic Image Synthesis Verifier (Fver) MLLM-based Assessment V={v₁,v₂,...,vₙ} All Criteria Met? Final Output High-Fidelity Image i < Nmax? Select Best Image Ibest = arg max (max fulfilled requirements) Agent Memory Working Memory (P, I, V) Experience (E) Compressor (Fcomp) - Distill reasoning Refiner (Fref) Prompt Evolution with Memory Next Iteration Key Components Planner: Strategic Entry Decomposer: Criteria Setup Generator: Image Synthesis Verifier: Quality Check Refiner: Prompt Evolution Agent Memory: History Agent Skill: Domain Expert Feedback Loop Memory Structure M = {(P₁,I₁,V₁,E₁), (P₂,I₂,V₂,E₂), ... (Pᵢ,Iᵢ,Vᵢ,Eᵢ)} P:Prompt I:Image V:Verify E:Experience Core Equations P₁ = Fplan(U, S) C = Fdec(U) I = Fgen(P) V = Fver(I,C) P' = Fref(P,I,V,M) E = Fcomp(T,M) Three Core Pillars Agent Loop Iterative Refinement Agent Memory Hierarchical Compression Agent Skill On-Demand Loading Key Results • +14.22 avg. on mainstream tasks • +14.03 avg. on downstream tasks • 6B model surpasses Nano Banana 2 • Early stopping efficiency • Avg. 2.80 iterations vs 3.26 Skill Manager
Q1
1. What are the three core pillars of the GEMS framework?
Agent Loop, Agent Memory, Agent Skill
Planner, Generator, Verifier
Decomposer, Refiner, Compressor
Q2
2. Which lightweight model did GEMS enable to surpass the state-of-the-art Nano Banana 2 on GenEval2?
Qwen-Image-2512
Z-Image-Turbo (6B)
Bagel
Q3
3. What does the Compressor in Agent Memory primarily do?
Generates candidate images for the next iteration
Distills verbose reasoning traces into concise experiences
Verifies whether generated images meet criteria

Today's Reading Tips 今日阅读推荐

Start with the DiNA paper, as it introduces a unified multimodal token paradigm that underpins both the later agent‑native generation (GEMS) and could be used to train simulation‑based perception models in CARLA‑Air. After grasping the token representation, move to GEMS to see how an agent loop and hierarchical memory extend that paradigm for high‑quality text‑to‑image synthesis, and finally explore CARLA‑Air if you need a physically coherent air‑ground simulation platform for embodied AI experiments.