2025-09-16 Papers

1/2

Paper 1

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Published: 2025-09-15

Link: http://arxiv.org/pdf/2509.12201

1. 📘 Topic and Domain: A large-scale multi-domain and multi-modal dataset called OmniWorld for 4D world modeling, focusing on computer vision and machine learning.

2. 💡 Previous Research and New Ideas: Based on existing datasets like Sintel, KITTI, and RealEstate10K which lack diversity and dynamic complexity; proposes a new comprehensive dataset combining synthetic game data with real-world footage across multiple domains.

3. ❓ Problem: Addresses the lack of high-quality, diverse data for training and evaluating 4D world modeling systems, particularly for tasks requiring complex spatial-temporal understanding.

4. 🛠️ Methods: Created OmniWorld by combining self-collected game footage (OmniWorld-Game) with curated public datasets, annotating them with depth maps, camera poses, text captions, optical flow, and foreground masks using specialized pipelines.

5. 📊 Results and Evaluation: Fine-tuning existing models on OmniWorld significantly improved their performance across tasks like depth estimation and camera-controlled video generation, with quantitative improvements shown on multiple benchmarks.

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

1/2

Paper 2

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Published: 2025-09-14

Link: http://arxiv.org/pdf/2509.11543

1. 📘 Topic and Domain: The paper focuses on advancing GUI (Graphical User Interface) automation using semi-online reinforcement learning, specifically in the domain of human-computer interaction and artificial intelligence.

2. 💡 Previous Research and New Ideas: Based on previous offline and online reinforcement learning approaches for GUI automation, the paper proposes a novel "Semi-online RL" paradigm that combines benefits of both by simulating online RL on offline trajectories.

3. ❓ Problem: The paper addresses the dilemma between offline RL (which enables stable training but struggles with multi-step tasks) and online RL (which captures trajectory-level signals but suffers from sparse rewards and high deployment costs).

4. 🛠️ Methods: The authors developed Semi-online RL with three key components: a semi-online rollout that simulates online interaction dynamics, a Patch Module that recovers from action mismatches, and a dual-level advantage computation system for policy optimization.

5. 📊 Results and Evaluation: Their UI-S1-7B model achieved state-of-the-art performance among 7B models across four dynamic benchmarks, with significant improvements over the base model (+12.0% on AndroidWorld, +23.8% on AITW), while maintaining competitive performance on single-turn tasks.

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

1/2

Paper 3

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Published: 2025-09-15

Link: http://arxiv.org/pdf/2509.12203

1. 📘 Topic and Domain: The paper presents LazyDrag, a training-free method for drag-based image editing using Multi-Modal Diffusion Transformers (MM-DiTs).

2. 💡 Previous Research and New Ideas: Based on previous drag-based editing methods that relied on implicit point matching via attention, this paper proposes using explicit correspondence maps instead of implicit matching.

3. ❓ Problem: The paper aims to solve the instability and limitations of current drag-based editing methods that require test-time optimization or weakened inversion strength, which compromises editing quality and capabilities.

4. 🛠️ Methods: The authors use a two-stage approach: first generating an explicit correspondence map from drag instructions, then using this map to drive attention controls for identity and background preservation in MM-DiTs.

5. 📊 Results and Evaluation: The method outperformed existing baselines on DragBench in terms of drag accuracy and perceptual quality, as validated by VIEScore metrics and human evaluation, achieving state-of-the-art performance without requiring test-time optimization.