2025-09-16 Papers

1/2

Paper 1

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Published: 2025-09-15

Link: http://arxiv.org/pdf/2509.12201

1. 📘 Topic and Domain: A large-scale multi-domain and multi-modal dataset called OmniWorld for 4D world modeling, focusing on computer vision and machine learning.
2. 💡 Previous Research and New Ideas: Based on existing datasets like Sintel, KITTI, and RealEstate10K which lack diversity and dynamic complexity; proposes a new comprehensive dataset combining synthetic game data with real-world footage across multiple domains.
3. ❓ Problem: Addresses the lack of high-quality, diverse data for training and evaluating 4D world modeling systems, particularly for tasks requiring complex spatial-temporal understanding.
4. 🛠️ Methods: Created OmniWorld by combining self-collected game footage (OmniWorld-Game) with curated public datasets, annotating them with depth maps, camera poses, text captions, optical flow, and foreground masks using specialized pipelines.
5. 📊 Results and Evaluation: Fine-tuning existing models on OmniWorld significantly improved their performance across tasks like depth estimation and camera-controlled video generation, with quantitative improvements shown on multiple benchmarks.

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

OmniWorld Dataset Creation and Benchmarking Pipeline Data Collection Phase Simulator Domain OmniWorld-Game ReShade + OBS Robot Domain AgiBot, DROID, RH20T Human Domain Epic-Kitchens, HOI4D Ego-Exo4D, etc. Internet Domain CityWalk Video Slicing & Quality Control Remove motion blur, insufficient features, excessive motion Segment long videos into manageable clips Multi-Modal Annotation Pipeline Depth Maps ReShade (Game) Prior Depth Anything FoundationStereo Camera Poses VGGT / DroidCalib CoTracker + BA Text Captions Qwen2-VL-72B Domain-specific prompting Optical Flow DPFlow High-resolution compatible Fg. Masks RoboEngine SAM 2 Grounding DINO 3D Geometric Prediction Benchmark Monocular & Video Depth Estimation Evaluated Models: DUSt3R, MASt3R, MonST3R, Fast3R CUT3R, FLARE, VGGT, MoGe Metrics: Abs Rel, δ<1.25 Camera Control Video Generation Benchmark Text-to-Video & Image-to-Video Evaluated Models: AC3D, CamCtrl, MotionCtrl, CAMI2V Metrics: TransError, RotError CamMC, FVD Model Fine-tuning & Validation Fine-tune SOTA models on OmniWorld → Significant performance improvements Validates OmniWorld as effective training resource for 4D world modeling 300M+ Frames 600K+ Videos
Q1
1. What is the main advantage of OmniWorld-Game compared to existing synthetic datasets?
It has higher frame resolution
It provides more modality types and larger data scale
It focuses only on indoor scenes
Q2
2. When fine-tuning models on OmniWorld, which component showed the most significant improvement?
Text generation capabilities
Audio processing
Camera pose estimation and depth prediction
Q3
3. What unique feature of OmniWorld's text annotations sets it apart from other datasets?
Its captions contain primarily between 150-250 tokens per description
It only uses single-word labels
It focuses exclusively on technical terminology
1/2

Paper 2

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Published: 2025-09-14

Link: http://arxiv.org/pdf/2509.11543

1. 📘 Topic and Domain: The paper focuses on advancing GUI (Graphical User Interface) automation using semi-online reinforcement learning, specifically in the domain of human-computer interaction and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on previous offline and online reinforcement learning approaches for GUI automation, the paper proposes a novel "Semi-online RL" paradigm that combines benefits of both by simulating online RL on offline trajectories.
3. ❓ Problem: The paper addresses the dilemma between offline RL (which enables stable training but struggles with multi-step tasks) and online RL (which captures trajectory-level signals but suffers from sparse rewards and high deployment costs).
4. 🛠️ Methods: The authors developed Semi-online RL with three key components: a semi-online rollout that simulates online interaction dynamics, a Patch Module that recovers from action mismatches, and a dual-level advantage computation system for policy optimization.
5. 📊 Results and Evaluation: Their UI-S1-7B model achieved state-of-the-art performance among 7B models across four dynamic benchmarks, with significant improvements over the base model (+12.0% on AndroidWorld, +23.8% on AITW), while maintaining competitive performance on single-turn tasks.

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

UI-S1: Semi-online Reinforcement Learning Workflow Expert Trajectories τ* = {(S₁*, a₁*), ..., (Sₜ*, aₜ*)} Static Offline Data Semi-online Rollout Generate N trajectories Maintain policy-generated history H_t = {(S₁, a₁, T₁), ..., (Sₜ₋₁, aₜ₋₁, Tₜ₋₁)} Simulate online dynamics Action Match? Continue Rollout Patch Module Thought-Free: (a*, ∅) Off-Policy: (a*, M₀(...)) On-Policy: (a*, M(...)) Reward Computation r_t = 0.1·r_format + 0.4·r_type + 0.5·r_acc Discounted Future Returns R_t = Σ γ^(k-t) r_k Step-level Advantage A_S(a_t) = (R_t - μ_t) / σ_t Local optimization signals Episode-level Advantage A_E(τ) = (R(τ) - μ_τ) / σ_τ Global task completion Combined Advantage A(a_t) = A_E(τ) + ω·A_S(a_t) Group-in-group optimization Semi-online Policy Optimization PPO with clipped surrogate objective J(θ) = E[min(ρ(θ)A(a_t), clip(ρ(θ))A(a_t))] Evaluation SOP (Semi-Online Performance) AndroidWorld, AITW MiniWob++ Strong correlation R² = 0.934 Key Innovation Bridge offline efficiency with online multi-turn reasoning capabilities +12.0% on AndroidWorld Yes No UI-S1-7B Results SOTA among 7B models +23.8% on AITW-Gen
Q1
1. What is the main innovation of the Semi-online RL approach compared to traditional methods?
It uses larger language models for training
It simulates online RL dynamics using offline trajectories
It completely eliminates the need for training data
Q2
2. When using the Patch Module in the paper's method, what happens if an action mismatch occurs?
The training immediately terminates
The model restarts from the beginning
The module replaces incorrect action with expert action and continues training
Q3
3. What was the most significant performance improvement achieved by UI-S1-7B over its base model?
+23.8% on AITW
+12.0% on AndroidWorld
+7.1% on GUI Odyssey
1/2

Paper 3

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Published: 2025-09-15

Link: http://arxiv.org/pdf/2509.12203

1. 📘 Topic and Domain: The paper presents LazyDrag, a training-free method for drag-based image editing using Multi-Modal Diffusion Transformers (MM-DiTs).
2. 💡 Previous Research and New Ideas: Based on previous drag-based editing methods that relied on implicit point matching via attention, this paper proposes using explicit correspondence maps instead of implicit matching.
3. ❓ Problem: The paper aims to solve the instability and limitations of current drag-based editing methods that require test-time optimization or weakened inversion strength, which compromises editing quality and capabilities.
4. 🛠️ Methods: The authors use a two-stage approach: first generating an explicit correspondence map from drag instructions, then using this map to drive attention controls for identity and background preservation in MM-DiTs.
5. 📊 Results and Evaluation: The method outperformed existing baselines on DragBench in terms of drag accuracy and perceptual quality, as validated by VIEScore metrics and human evaluation, achieving state-of-the-art performance without requiring test-time optimization.

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

LazyDrag: Explicit Correspondence-Driven Drag Editing Input Image + Drag Instructions D={(s_i, e_i)} DDIM Inversion Full Strength z_T extraction Explicit Correspondence Map Generation Winner-Takes-All Displacement Field V Matching Map M(x) & A(x) Latent Init ẑ_T with Gaussian noise Region Partitioning Rbg Background Rdst Destination Rinp Inpainting Rtrans Transition MM-DiT Denoising with Attention Control Input Control Token Replacement Token Concatenation Q, K, V manipulation via correspondence map Output Control Attention Refinement Gated Merging Weight: γ = h_t · A(x) y ← (1-γ)y + γy_cached Final Output Edited Image with Identity Preservation + Text Guidance Key Innovation: Explicit Correspondence vs Implicit Attention Matching • Replaces fragile attention-similarity matching with deterministic drag-based correspondence • Enables full-strength inversion without test-time optimization (TTO) • First drag editing method for Multi-Modal Diffusion Transformers with text guidance Achieved Benefits ✓ No test-time optimization required ✓ Full-strength inversion capability ✓ High-fidelity inpainting ✓ Text-guided semantic editing ✓ Multi-round editing workflows Technical Components • Winner-takes-all displacement fusion • Gaussian noise for inpainting regions • Correspondence-driven token control • Gated attention output refinement • Single-stream attention manipulation
Q1
1. What is the key innovation of LazyDrag compared to previous drag-based editing methods?
Using an explicit correspondence map instead of implicit point matching
Introducing a new type of neural network architecture
Adding more layers to the diffusion model
Q2
2. What was a major limitation that LazyDrag aimed to overcome?
Slow processing speed of image editing
The need for test-time optimization or weakened inversion strength
High memory requirements for operation
Q3
3. How does LazyDrag handle multiple opposing drag instructions?
By averaging all drag directions
By canceling out opposing forces
By using a winner-takes-all approach with Voronoi partitioning