2025-07-30 Papers

1/2

Paper 1

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

Published: 2025-07-29

Link: http://arxiv.org/pdf/2507.21809

1. 📘 Topic and Domain: The paper focuses on generating immersive, explorable, and interactive 3D worlds from text or images using AI, falling within the domains of computer vision and computer graphics.
2. 💡 Previous Research and New Ideas: The paper builds on previous video-based and 3D-based world generation methods, proposing a novel framework that combines both approaches through a semantically layered 3D mesh representation with panoramic world proxies.
3. ❓ Problem: The paper addresses the limitations of existing world generation approaches, where video-based methods lack 3D consistency and rendering efficiency, while 3D-based methods struggle with limited training data and memory-inefficient representations.
4. 🛠️ Methods: The authors developed HunyuanWorld 1.0, which uses a staged generative framework combining panorama generation, world layering through agentic decomposition, and layer-wise 3D reconstruction with cross-layer depth alignment.
5. 📊 Results and Evaluation: The system achieved state-of-the-art performance in generating coherent 3D worlds, outperforming existing approaches across multiple metrics (BRISQUE, NIQE, Q-Align, CLIP scores), while enabling practical applications in virtual reality, physical simulation, and game development.

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

HunyuanWorld 1.0 Workflow Text Input LLM Enhancement Image Input ERP Projection Panorama-DiT Diffusion Transformer Circular Denoising Elevation-aware Aug Agentic World Layering Instance Recognition Layer Decomposition Layer Completion VLM + Grounding DINO + ZIM Segmentation Layer-Aligned Depth Estimation Cross-layer Alignment Layer-wise 3D World Reconstruction Sheet Warping Mesh Generation 3DGS Alternative Foreground Objects Direct Projection Image-to-3D Gen Auto Placement Background Layer Depth Compression Sheet Warping Sky Layer Uniform Depth HDRI Support Long-Range Extension Voyager Video Diffusion World Caching Auto-regressive Extension System Optimization Mesh Compression TensorRT Acceleration Multi-GPU Parallel Virtual Reality 360° Immersive Game Dev Mesh Export Physics Sim Collision Detection Object Interaction Disentangled Objects Panoramic Data Curation Pipeline Commercial Data Open Source UE Rendered Quality Assessment Human Annotation Three-stage Captioning + Scene-aware Prompt Generation Key Innovations Panoramic World Proxy Semantically Layered 3D Mesh Disentangled Object Modeling Cross-layer Depth Alignment World-consistent Video Extension
Q1
1. What is the key innovation in HunyuanWorld 1.0's approach to handling 3D world generation compared to previous methods?
Using only video-based generation techniques
Combining both video and 3D-based approaches through semantically layered mesh representation
Focusing exclusively on 3D mesh optimization
Q2
2. Which stage comes first in HunyuanWorld 1.0's generation pipeline?
Layer-wise 3D reconstruction
World layering through agentic decomposition
Panorama generation as world proxy
Q3
3. What unique feature of HunyuanWorld 1.0 enables better interaction with generated 3D worlds?
High-resolution textures
Real-time rendering capabilities
Disentangled object representations allowing individual object manipulation
1/2

Paper 2

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

Published: 2025-07-29

Link: http://arxiv.org/pdf/2507.22058

1. 📘 Topic and Domain: The paper focuses on improving discrete autoregressive image generation models using reinforcement learning, operating in the domain of computer vision and machine learning.
2. 💡 Previous Research and New Ideas: The paper builds on previous work in autoregressive image generation like DALL-E but proposes using reinforcement learning to improve generation quality instead of switching to diffusion models like recent research has done.
3. ❓ Problem: The paper addresses the limitations of discrete autoregressive image models which typically suffer from low visual fidelity, distorted outputs, and poor instruction following due to cumulative errors during generation.
4. 🛠️ Methods: The authors developed X-Omni, which combines a semantic image tokenizer, unified autoregressive model for language and images, and offline diffusion decoder, using Group Relative Policy Optimization (GRPO) reinforcement learning to align the model outputs.
5. 📊 Results and Evaluation: X-Omni achieved state-of-the-art performance in image generation tasks using a 7B language model, demonstrating high aesthetic quality, strong instruction following capabilities, and accurate text rendering in both English and Chinese.

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

X-Omni Methodology Flow Architecture Components • SigLIP-VQ Tokenizer • Autoregressive Model • Diffusion Decoder • Qwen2.5-7B Base • Vision-specific blocks Pre-training Stage 1: Vision blocks only Stage 2: All components Stage 3: LR annealing 600B tokens (generation) 100B tokens (understanding) Supervised Fine-tuning High-quality datasets BLIP3o-60k + synthetic Understanding data mix 1.5B tokens total Sequence length: 16,384 Reinforcement Learning GRPO Algorithm Group rollouts (G=16) Multi-component rewards 200 training steps 180K prompt samples GRPO Process Policy Rollout Reward Generate trajectories → Decode images → Compute rewards → Update policy Clipped importance sampling with KL penalty Multi-Component Reward System HPSv2 Aesthetics Unified Reward High-res Quality Qwen2.5-VL Text-Image Align OCR Models Text Rendering GOT-OCR2.0 + PaddleOCR for text accuracy Weighted aggregation of all reward components Comprehensive guidance for aesthetic, semantic, and text quality Key Technical Features • Unified autoregressive modeling for text and images • Semantic tokenization via SigLIP encoder • No dependency on classifier-free guidance • Distribution alignment between AR and diffusion Key Results & Capabilities • State-of-the-art text rendering (English & Chinese) • Superior performance on DPG-Bench and GenEval • Competitive image understanding capabilities • RL outperforms SFT with Best-of-N sampling Training Data Pipeline Pre-training: 200M images COYO-700M, DataComp-1B, LAION-2B Dense captions via Qwen2.5-VL-72B SFT: High-quality filtered data BLIP3o-60k + synthetic + understanding RL: 180K diverse prompts Midjourney + text-rich + natural Bucket-based sampling strategy First unified model with superior long text rendering via RL optimization
Q1
1. What is the key innovation that X-Omni uses to improve discrete autoregressive image generation?
Using larger language models
Applying reinforcement learning
Switching to diffusion models
Q2
2. What unique capability does X-Omni demonstrate compared to other unified models?
Faster image generation speed
Higher resolution outputs
Accurate rendering of long texts in both English and Chinese
Q3
3. What advantage does X-Omni have over other models in terms of classifier-free guidance (CFG)?
It requires less computational resources by not relying on CFG
It uses an improved version of CFG
It combines multiple types of CFG
1/2

Paper 3

Geometric-Mean Policy Optimization

Published: 2025-07-28

Link: http://arxiv.org/pdf/2507.20673

1. 📘 Topic and Domain: The paper focuses on improving reinforcement learning for large language models through a new policy optimization approach called Geometric-Mean Policy Optimization (GMPO).
2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO), the paper proposes using geometric mean instead of arithmetic mean of token-level rewards, introducing a more stable optimization approach.
3. ❓ Problem: The paper addresses the instability in GRPO caused by outlier importance-weighted rewards during training, which leads to extreme importance sampling ratios and unstable policy updates.
4. 🛠️ Methods: GMPO maximizes the geometric mean of token-level rewards with token-level clipping and wider clipping thresholds (e−0.4, e0.4), enabling more stable training while maintaining exploration capabilities.
5. 📊 Results and Evaluation: GMPO outperformed GRPO by 4.1% on mathematical benchmarks (63.4% vs. 59.3%) with DeepSeek-R1-Distill-Qwen-7B and improved by 1.4% on Geometry3K multimodal reasoning benchmark (54.7% vs. 53.3%) with Qwen2.5-VL-Instruct-7B.

Geometric-Mean Policy Optimization

Geometric-Mean Policy Optimization (GMPO) Workflow Training Dataset MATH Level 3-5 Geometry3K Base Models Qwen2.5-Math-7B DeepSeek-R1-Distill Rollout Generation 8 rollouts per question Max 3000 tokens Reward Calculation 1 for correct 0 for incorrect Key Innovation: Geometric Mean vs Arithmetic Mean GRPO (Baseline) Arithmetic Mean: 1/G Σ (1/|o|) Σ (π_θ/π_old)  Sensitive to outliers GMPO (Proposed) Geometric Mean: 1/G Σ ∏ (π_θ/π_old)^(1/|o|) Robust to outliers Advantage Calculation  = (r - mean(R)) / std(R) Group-relative normalization Same as GRPO Token-Level Clipping Range: (e^-0.4, e^0.4) Wider than GRPO Better exploration Stability Benefits Narrower value range Lower KL divergence Higher token entropy Training Process 1024 rollouts/round 8 updates Batch size: 128 Mathematical Foundation Objective Bound |J_GMPO| ≤ |J_GRPO| Narrower range Gradient Analysis Holistic view for updates Robust to extremes Importance Sampling More stable ratios Less extreme values Experimental Results Math Benchmarks +4.1% improvement AIME24, AMC, MATH500 Minerva, OlympiadBench Multimodal Task +1.4% improvement Geometry3K Visual reasoning Model Performance Consistent gains 1.5B and 7B models Multiple architectures Training Stability Better convergence Lower variance More exploration Enhanced LLM Reasoning
Q1
1. What is the main problem that GMPO aims to solve compared to GRPO?
Slow training speed in mathematical reasoning tasks
Unstable policy updates due to outlier importance-weighted rewards
High computational resource requirements
Q2
2. What unique modification does GMPO introduce to improve upon GRPO?
Uses arithmetic mean with larger batch sizes
Implements a new reward function
Maximizes geometric mean of token-level rewards instead of arithmetic mean
Q3
3. What is the optimal clipping threshold range used in GMPO that achieved the best performance?
(e^-0.4, e^0.4)
(0.8, 1.2)
(-∞, +∞)