2025-07-30 Papers

1/2

Paper 1

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

Published: 2025-07-29

Link: http://arxiv.org/pdf/2507.21809

1. 📘 Topic and Domain: The paper focuses on generating immersive, explorable, and interactive 3D worlds from text or images using AI, falling within the domains of computer vision and computer graphics.

2. 💡 Previous Research and New Ideas: The paper builds on previous video-based and 3D-based world generation methods, proposing a novel framework that combines both approaches through a semantically layered 3D mesh representation with panoramic world proxies.

3. ❓ Problem: The paper addresses the limitations of existing world generation approaches, where video-based methods lack 3D consistency and rendering efficiency, while 3D-based methods struggle with limited training data and memory-inefficient representations.

4. 🛠️ Methods: The authors developed HunyuanWorld 1.0, which uses a staged generative framework combining panorama generation, world layering through agentic decomposition, and layer-wise 3D reconstruction with cross-layer depth alignment.

5. 📊 Results and Evaluation: The system achieved state-of-the-art performance in generating coherent 3D worlds, outperforming existing approaches across multiple metrics (BRISQUE, NIQE, Q-Align, CLIP scores), while enabling practical applications in virtual reality, physical simulation, and game development.

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

1/2

Paper 2

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

Published: 2025-07-29

Link: http://arxiv.org/pdf/2507.22058

1. 📘 Topic and Domain: The paper focuses on improving discrete autoregressive image generation models using reinforcement learning, operating in the domain of computer vision and machine learning.

2. 💡 Previous Research and New Ideas: The paper builds on previous work in autoregressive image generation like DALL-E but proposes using reinforcement learning to improve generation quality instead of switching to diffusion models like recent research has done.

3. ❓ Problem: The paper addresses the limitations of discrete autoregressive image models which typically suffer from low visual fidelity, distorted outputs, and poor instruction following due to cumulative errors during generation.

4. 🛠️ Methods: The authors developed X-Omni, which combines a semantic image tokenizer, unified autoregressive model for language and images, and offline diffusion decoder, using Group Relative Policy Optimization (GRPO) reinforcement learning to align the model outputs.

5. 📊 Results and Evaluation: X-Omni achieved state-of-the-art performance in image generation tasks using a 7B language model, demonstrating high aesthetic quality, strong instruction following capabilities, and accurate text rendering in both English and Chinese.

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

1/2

Paper 3

Geometric-Mean Policy Optimization

Published: 2025-07-28

Link: http://arxiv.org/pdf/2507.20673

1. 📘 Topic and Domain: The paper focuses on improving reinforcement learning for large language models through a new policy optimization approach called Geometric-Mean Policy Optimization (GMPO).

2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO), the paper proposes using geometric mean instead of arithmetic mean of token-level rewards, introducing a more stable optimization approach.

3. ❓ Problem: The paper addresses the instability in GRPO caused by outlier importance-weighted rewards during training, which leads to extreme importance sampling ratios and unstable policy updates.

4. 🛠️ Methods: GMPO maximizes the geometric mean of token-level rewards with token-level clipping and wider clipping thresholds (e−0.4, e0.4), enabling more stable training while maintaining exploration capabilities.

5. 📊 Results and Evaluation: GMPO outperformed GRPO by 4.1% on mathematical benchmarks (63.4% vs. 59.3%) with DeepSeek-R1-Distill-Qwen-7B and improved by 1.4% on Geometry3K multimodal reasoning benchmark (54.7% vs. 53.3%) with Qwen2.5-VL-Instruct-7B.