2025-07-25 Papers

1/2

Paper 1

Group Sequence Policy Optimization

Published: 2025-07-23

Link: http://arxiv.org/pdf/2507.18071

1. 📘 Topic and Domain: The paper introduces a new reinforcement learning algorithm called Group Sequence Policy Optimization (GSPO) for training large language models.
2. 💡 Previous Research and New Ideas: Based on previous GRPO (Group Relative Policy Optimization) algorithm, it proposes a novel sequence-level approach rather than token-level optimization for reinforcement learning.
3. ❓ Problem: The paper aims to solve the instability and inefficiency issues in current RL algorithms like GRPO, which can lead to model collapse when training large language models.
4. 🛠️ Methods: GSPO defines importance ratios based on sequence likelihood rather than token-level weights, and performs sequence-level clipping, rewarding, and optimization.
5. 📊 Results and Evaluation: GSPO achieved superior training stability and efficiency compared to GRPO, stabilized Mixture-of-Experts (MoE) RL training without requiring complex stabilization strategies, and contributed to performance improvements in Qwen3 models.

Group Sequence Policy Optimization

Group Sequence Policy Optimization (GSPO) Workflow GRPO Issues Token-level importance ratios cause instability Key Innovation Sequence-level importance ratios GSPO Algorithm Sequence-level clipping and optimization Mathematical Foundation Importance Sampling: s_i(θ) = π_θ(y_i|x)/π_θ_old(y_i|x)^(1/|y_i|) Length normalized Group Advantage Â_i = (r(x,y_i) - mean)/std Normalized rewards across multiple responses Objective Function J_GSPO = E[1/G Σ min(s_i(θ)Â_i, clip(s_i(θ), 1-ε, 1+ε)Â_i)] Sequence-level clipping Training Process Generate G responses per query Gradient Analysis Equal token weighting vs GRPO's unequal Stability Benefits Prevents model collapse MoE Training Eliminates need for Routing Replay Handles expert volatility Infrastructure Precision tolerance Training-inference compatibility Results Superior efficiency Better performance on benchmarks GSPO-token Token-level variant for multi-turn RL Application Qwen3 models production success Key Innovation Sequence-level vs Token-level processing Benefits Stability + Efficiency + MoE compatibility
Q1
1. What is the main innovation of GSPO compared to previous algorithms?
It uses token-level importance ratios
It defines importance ratio based on sequence likelihood
It eliminates the need for rewards completely
Q2
2. What surprising observation was made about clipping fractions in GSPO versus GRPO?
GSPO clips two orders of magnitude more tokens but achieves better efficiency
GSPO and GRPO had identical clipping fractions
GSPO clips fewer tokens than GRPO
Q3
3. How does GSPO benefit MoE (Mixture-of-Experts) model training?
It requires more complex routing strategies
It makes MoE training impossible
It eliminates the need for Routing Replay strategy while maintaining stability
1/2

Paper 2

Captain Cinema: Towards Short Movie Generation

Published: 2025-07-24

Link: http://arxiv.org/pdf/2507.18634

1. 📘 Topic and Domain: The paper presents "Captain Cinema," a framework for generating short movies from textual descriptions, operating in the domain of AI-generated video content and narrative storytelling.
2. 💡 Previous Research and New Ideas: Based on previous text-to-video models that could only generate 5-10 second clips, this paper introduces a novel two-stage approach combining top-down keyframe planning with bottom-up video synthesis for longer, narratively coherent videos.
3. ❓ Problem: The paper addresses the challenge of generating long-form, narratively coherent videos with consistent characters and scenes, as existing approaches struggle with maintaining coherence beyond short clips.
4. 🛠️ Methods: The method uses a two-stage approach: first generating keyframes using a Multimodal Diffusion Transformer with GoldenMem compression for long-context memory, then synthesizing video between keyframes using interleaved conditioning.
5. 📊 Results and Evaluation: The results show superior performance in generating visually coherent and narratively consistent short movies compared to baselines, evaluated through automated metrics and user studies, with particularly strong results in temporal dynamics and character consistency preservation.

Captain Cinema: Towards Short Movie Generation

Captain Cinema: Short Movie Generation Framework Top-Down Keyframe Planning Movie Storyline MM-DiT with Hybrid Attention Masking GoldenMem Compression Progressive Long-Context Dynamic Stride Sampling Sequence of Keyframes Bottom-Up Video Synthesis Multi-keyframe Conditioning Video Generation Model Interleaved Conditioning Long Context Learning Multi-Scene Short Movie Data Processing Pipeline Movie Data (500 hours) Scene Detection Frame Extraction Gemini Annotation 300K Keyframes + Video Shots Key Technical Contributions Hybrid Attention Local + Global Processing GoldenMem Visual Memory Compression Progressive Context Training Semantic-Oriented Context Retrieval Multi-Scene Narrative Coherence Keyframes
Q1
1. What is the main innovation of Captain Cinema compared to previous text-to-video models?
It uses higher resolution video generation
It combines top-down keyframe planning with bottom-up video synthesis for longer narratives
It generates videos faster than previous models
Q2
2. What is the purpose of the GoldenMem feature in Captain Cinema?
To compress and manage long-context visual memory efficiently
To improve the video rendering quality
To generate better audio for the videos
Q3
3. What potential ethical concern about this technology is mentioned in the paper?
High energy consumption
Privacy violations
Risk of hyper-realistic misinformation
1/2

Paper 3

TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

Published: 2025-07-24

Link: http://arxiv.org/pdf/2507.18537

1. 📘 Topic and Domain: Test-time scaling framework for Visual Auto-Regressive (VAR) image generation models.
2. 💡 Previous Research and New Ideas: Based on test-time scaling in diffusion models and LLMs, proposes the first scaling framework specifically designed for VAR models' coarse-to-fine generation process.
3. ❓ Problem: How to improve image generation quality in VAR models without additional training or substantial computational costs.
4. 🛠️ Methods: Implements adaptive descending batch sizes, clustering-based diversity search for early scales, and resampling-based potential selection for late scales.
5. 📊 Results and Evaluation: Achieved 8.7% improvement in GenEval score (0.69→0.75) on the Infinity model, with consistent improvements across multiple evaluation metrics.

TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

TTS-VAR Framework Workflow Text Prompt Input VAR Model Multi-scale Generation TTS-VAR Framework Adaptive Batch Sampling Descending batch sizes [8,8,6,6,6,4,2,2,2,1,1,1,1] Early Scales (Coarse) Clustering-based Diversity Search DINOv2 Features K-Means++ Clustering Structural Information Extraction Late Scales (Fine) Resampling-based Potential Selection Reward Functions ImageReward Scoring Potential Scores VALUE, MAX, SUM, DIFF Multi-Scale Generation Process S1 S3 S5 ... S6 S9 ... S12 Clustering Phase Resampling Phase High-Quality Images GenEval: 0.69 → 0.75 8.7% Improvement
Q1
1. What is the main challenge in applying test-time scaling to VAR models compared to diffusion models?
VAR models require more computational resources
Early-scale tokens in VAR cannot be refined once generated
VAR models have lower quality outputs
Q2
2. Why does TTS-VAR use clustering-based diversity search in early scales instead of reward-based selection?
To reduce computational costs
Because clustering is more accurate than rewards
Because early-scale rewards don't accurately predict final image quality
Q3
3. What unique feature of the batch size schedule does TTS-VAR implement?
Uses fixed batch sizes throughout generation
Increases batch sizes progressively
Uses larger batches in early scales and decreases them in later scales