2026-04-01 Papers

1/2

Paper 1

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28032

1. 📘 Topic and Domain: The paper presents CARLA-Air, a unified simulation infrastructure for air-ground embodied intelligence research that integrates aerial drone and ground vehicle simulation within a single physically coherent environment.

2. 💡 Previous Research and New Ideas: Based on CARLA (urban driving simulator) and AirSim (UAV simulator), the new idea is integrating both backends within a single Unreal Engine process to overcome the limitations of domain-segregated simulators and bridge-based co-simulation approaches.

3. ❓ Problem: The paper solves the problem that existing open-source simulators cannot jointly model aerial and ground agents with strict spatial-temporal consistency—driving simulators lack aerial dynamics, UAV simulators lack realistic ground scenes, and co-simulation introduces synchronization overhead.

4. 🛠️ Methods: The method uses a composition-based design that resolves the UE4 single-game-mode constraint by having CARLAAirGameMode inherit CARLA's ground simulation while composing AirSim's aerial flight actor as a regular world entity, preserving both native APIs within a shared physics tick and rendering pipeline.

5. 📊 Results and Evaluation: The platform achieves ~20 FPS under joint air-ground workloads, maintains stable operation over 3-hour endurance tests with zero crashes across 357 reset cycles, delivers <0.5ms API latency, and was validated through five representative workflows including precision landing, multi-modal dataset collection, and RL training.

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

1/2

Paper 2

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Published: 2026-03-29

Link: http://arxiv.org/pdf/2603.27538

1. 📘 Topic and Domain: The paper addresses native multimodal modeling, proposing a unified framework that represents text, vision, and audio within a shared discrete token space for autoregressive generation.

2. 💡 Previous Research and New Ideas: Based on the Next-Token Prediction (NTP) paradigm and prior multimodal systems that treat non-linguistic modalities as external attachments, the paper introduces the Discrete Native Autoregression (DiNA) paradigm and dNaViT (Discrete Native Resolution Vision Transformer) for unified multimodal understanding and generation.

3. ❓ Problem: The paper solves the challenge of representing non-linguistic modalities (vision and audio) within a discrete token space, overcoming the performance ceiling of discrete visual modeling and reconciling the traditionally conflicting objectives of understanding and generation.

4. 🛠️ Methods: The paper employs Semantic-and-Aligned Encoders (SAE) for semantic completeness, Residual Vector Quantization (RVQ) for hierarchical discrete tokenization, a modality-agnostic MoE backbone, and internal linguistic guidance for audio generation with parallel/serial strategies.

5. 📊 Results and Evaluation: LongCat-Next achieves competitive performance on visual understanding benchmarks (MMMU 70.6, MathVista 83.1), strong visual generation (GenEval 84.44), state-of-the-art audio capabilities, and maintains robust text capabilities, outperforming unified multimodal models while reconciling understanding and generation objectives.

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

1/2

Paper 3

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28088

1. 📘 Topic and Domain: The paper focuses on multimodal text-to-image generation, specifically developing an agent-native framework called GEMS to enhance generation quality for complex instructions and specialized downstream tasks.

2. 💡 Previous Research and New Ideas: Based on advanced agent frameworks (Claude Code, OpenClaw) and inference-time scaling methods, the paper introduces a novel combination of Agent Loop (iterative closed-loop optimization), Agent Memory (hierarchical compression for trajectory-level persistence), and Agent Skill (on-demand domain-specific expertise loading).

3. ❓ Problem: Current multimodal generation models struggle with complex, multi-faceted instructions and specialized downstream applications, representing a "long-tail" challenge where general-purpose capabilities reach their limits.

4. 🛠️ Methods: GEMS employs three core components: Agent Loop uses Planner, Decomposer, Generator, Verifier, and Refiner modules for iterative refinement; Agent Memory uses hierarchical compression to store factual artifacts and distilled experiences; Agent Skill provides extensible domain-specific expertise with on-demand loading.

5. 📊 Results and Evaluation: GEMS enables the 6B Z-Image-Turbo model to surpass state-of-the-art Nano Banana 2 on GenEval2, achieving average performance gains of 14.22 on mainstream benchmarks and 14.03 on downstream tasks, while maintaining superior efficiency with fewer average images generated.

GEMS: Agent-Native Multimodal Generation with Memory and Skills

_plan) Strategic Entry Point Agent Skill Domain-Specific Expertise On-Demand Loading Creative Drawing Aesthetic Drawing Text Rendering Spatial Intelligence Skill Selection Enhanced Prompt Decomposer (F_dec) Partition into Atomic Criteria C={c₁,c₂,...,cₙ} Atomic Criteria Binary (yes/no) probes Generator (F_gen) Model-Agnostic Image Synthesis Verifier (F_ver) MLLM-based Assessment V={v₁,v₂,...,vₙ} All Criteria Met? Final Output High-Fidelity Image i < N_max? Select Best Image I_best = arg max (max fulfilled requirements) Agent Memory Working Memory (P, I, V) Experience (E) Compressor (F_comp) - Distill reasoning Refiner (F_ref) Prompt Evolution with Memory Next Iteration Key Components Planner: Strategic Entry Decomposer: Criteria Setup Generator: Image Synthesis Verifier: Quality Check Refiner: Prompt Evolution Agent Memory: History Agent Skill: Domain Expert Feedback Loop Memory Structure M = {(P₁,I₁,V₁,E₁), (P₂,I₂,V₂,E₂), ... (Pᵢ,Iᵢ,Vᵢ,Eᵢ)} P:Prompt I:Image V:Verify E:Experience Core Equations P₁ = F_plan(U, S) C = F_dec(U) I = F_gen(P) V = F_ver(I,C) P' = F_ref(P,I,V,M) E = F_comp(T,M) Three Core Pillars Agent Loop Iterative Refinement Agent Memory Hierarchical Compression Agent Skill On-Demand Loading Key Results • +14.22 avg. on mainstream tasks • +14.03 avg. on downstream tasks • 6B model surpasses Nano Banana 2 • Early stopping efficiency • Avg. 2.80 iterations vs 3.26 Skill Manager

Today's Reading Tips 今日阅读推荐

Start with the DiNA paper, as it introduces a unified multimodal token paradigm that underpins both the later agent‑native generation (GEMS) and could be used to train simulation‑based perception models in CARLA‑Air. After grasping the token representation, move to GEMS to see how an agent loop and hierarchical memory extend that paradigm for high‑quality text‑to‑image synthesis, and finally explore CARLA‑Air if you need a physically coherent air‑ground simulation platform for embodied AI experiments.