2025-09-29 Papers

1/2

Paper 1

LongLive: Real-time Interactive Long Video Generation

Published: 2025-09-26

Link: http://arxiv.org/pdf/2509.22622

1. 📘 Topic and Domain: Real-time interactive long video generation using frame-level autoregressive models for AI-powered video content creation.
2. 💡 Previous Research and New Ideas: Built upon diffusion models and autoregressive video generation, introducing new techniques like KV-recache, streaming long tuning, and short window attention with frame sink.
3. ❓ Problem: Addressing the challenges of generating high-quality long videos efficiently while enabling real-time interactive control through prompt switching.
4. 🛠️ Methods: Implemented KV-recache to refresh cached states during prompt switches, streaming long tuning for train-long-test-long alignment, and short window attention with frame sink for faster generation.
5. 📊 Results and Evaluation: Achieved 20.7 FPS on a single NVIDIA H100 GPU, supported up to 240-second video generation, and outperformed baselines on VBench benchmarks while requiring only 32 GPU-days for training.

LongLive: Real-time Interactive Long Video Generation

LONGLIVE: Real-time Interactive Long Video Generation Workflow Sequential User Prompts Interactive Input Real-time Streaming Frame-level Autoregressive Framework Causal Attention + KV Caching 1.3B Parameters 832×480 Resolution @ 16 FPS Up to 240 seconds KV Re-cache Mechanism Refreshes cached states with new prompts Ensures smooth prompt transitions Maintains visual consistency Semantic adherence across switches Streaming Long Tuning Train-long → Test-long alignment Self-supervised on long sequences Reduces error accumulation 32 GPU-days training Efficient Long Inference Short Window Attention Frame-level Attention Sink 28% compute reduction 17% memory reduction Training Pipeline Base Model Wan2.1-T2V-1.3B DMD Self-Forcing Short Clip Adaptation Long Sequence Training 60s with Prompt Switches LoRA Fine-tuning Rank 256, 27% params Inference Pipeline Sequential Prompts User Input Stream Frame Generation Causal Rollout KV Re-cache At Switch Points Real-time Output 20.7 FPS Quality VBench: 84.87 Long Video: 83.52 Interactive: 84.38 Strong Consistency Smooth Transitions Efficiency 20.7 FPS (H100) 41× faster than SkyReels-V2 Real-time Generation INT8 Support Scalability Up to 240 seconds Single H100 GPU Multiple Switches Memory Efficient O(W+T+S) complexity Applications Interactive Storytelling Creative Content Real-time Control Educational Videos Cinematic Production Long Video Output Interactive Control High Quality KEY INNOVATION
Q1
1. What is the main innovation of KV-recache in LONGLIVE compared to traditional KV caching?
It completely discards all previous cached states
It refreshes cached states by combining previous videos with new prompt embeddings
It stores cached states permanently without any updates
Q2
2. What is the maximum video length that LONGLIVE can generate on a single H100 GPU?
60 seconds
120 seconds
240 seconds
Q3
3. How does streaming long tuning improve the model's performance?
By using larger batch sizes during training
By training only on short video clips
By aligning training and inference conditions through long sequence generation
1/2

Paper 2

Quantile Advantage Estimation for Entropy-Safe Reasoning

Published: 2025-09-26

Link: http://arxiv.org/pdf/2509.22611

1. 📘 Topic and Domain: Reinforcement learning for large language models, specifically focusing on entropy control in language model reasoning tasks.
2. 💡 Previous Research and New Ideas: Based on value-free RL methods like GRPO and DAPO, proposes a new Quantile Advantage Estimation (QAE) approach that replaces mean-baseline with group-wise K-quantile baseline.
3. ❓ Problem: Addresses the dual challenge of preventing both entropy collapse (premature convergence) and entropy explosion (uncontrolled exploration) in LLM reinforcement learning.
4. 🛠️ Methods: Implements a K-quantile baseline that creates a two-regime gate: reinforcing rare successes on hard queries and targeting remaining failures on easy queries, with theoretical guarantees for entropy safety.
5. 📊 Results and Evaluation: Achieved sustained pass@1 gains on Qwen3-8B/14B-Base across AIME'24/'25 and AMC'23 benchmarks, with roughly 80% of responses receiving zero advantage, demonstrating more efficient credit assignment.

Quantile Advantage Estimation for Entropy-Safe Reasoning

Quantile Advantage Estimation (QAE) for Entropy-Safe Reasoning Problem Analysis Mean baseline in RLVR causes: • Entropy collapse • Entropy explosion QAE Method Replace mean baseline with K-quantile baseline Hard Queries (p ≤ 1-K) Baseline = 0 Reinforce rare successes Easy Queries (p > 1-K) Baseline = 1 Target remaining failures Advantage Estimation Â_i = (R_i - Q_K({R_j})) / std({R_j}) + ε Theoretical Guarantee Two-sided entropy safety under first-order updates Implementation • Drop-in replacement • Single parameter K • Compatible with existing Sparsity Effect ~80% responses get zero advantage Focus on informative samples Entropy Control • Prevents explosion • Prevents collapse • Stable training Experimental Results • Consistent pass@1 improvements • Stable across model sizes • AIME'24/'25, AMC'23 benchmarks Model Compatibility • Qwen3-8B/14B/30B • DAPO, GSPO, CLIP-COV • KL-COV methods Key Innovation Baseline design as entropy control mechanism Benefits • Stable entropy • Better sample efficiency Scalability • Cross-model sizes • Multiple algorithms Performance • Sustained gains • No plateau Simplicity • One-line change • Single parameter K
Q1
1. What is the main innovation of QAE compared to previous methods like GRPO and DAPO?
It introduces a new token-level clipping mechanism
It replaces the mean baseline with a K-quantile baseline
It adds a new entropy regularization term
Q2
2. What interesting empirical observation did the authors make about QAE's efficiency?
It reduced training time by 50%
It required twice the computing resources
About 80% of responses received zero advantage
Q3
3. How does QAE handle different difficulty levels of queries?
It treats all queries the same way regardless of difficulty
It reinforces rare successes on hard queries and targets remaining failures on easy ones
It only focuses on easy queries to maximize performance
1/2

Paper 3

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Published: 2025-09-26

Link: http://arxiv.org/pdf/2509.22576

1. 📘 Topic and Domain: Training large language model (LLM) agents in multi-turn environments using reinforcement learning, focusing on entropy-regularized policy optimization.
2. 💡 Previous Research and New Ideas: Based on traditional reinforcement learning approaches like PPO and GRPO; proposes a new framework called EPO that introduces entropy smoothing regularization and adaptive phase-based weighting.
3. ❓ Problem: Addresses the "exploration-exploitation cascade failure" in multi-turn environments with sparse rewards, where agents either commit to flawed strategies too early or engage in chaotic exploration that destabilizes training.
4. 🛠️ Methods: Implements three mechanisms: entropy regularization across multi-turn settings, entropy smoothing regularizer to prevent abrupt fluctuations, and adaptive phase-based weighting to balance exploration and exploitation throughout training.
5. 📊 Results and Evaluation: Achieved up to 152% performance improvement on ScienceWorld and 19.8% on ALFWorld benchmarks compared to baselines, with significantly more stable training dynamics and better generalization to unseen tasks.

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

EPO: Entropy-regularized Policy Optimization Workflow Problem Identification Exploration-Exploitation Cascade Failure Multi-turn Environment 30+ turns per episode Sparse rewards Baseline Methods PPO, GRPO Standard RL approaches EPO Framework Entropy-regularized Policy Optimization Component 1: Entropy Regularization Multi-turn entropy computation L_H(θ) = average entropy across trajectories Component 2: Entropy Smoothing Historical entropy bounds L_smooth(θ) with penalty for deviations Component 3: Adaptive Weighting Dynamic coefficient β_k Phase-based balancing exploration-exploitation EPO Loss Function L_EPO(θ) = L_MT(θ) - λ[L_H(θ) - β_k L_smooth(θ)] Multi-turn loss + Entropy regularization + Smoothing penalty Training Process Collect trajectories Update entropy history Optimize EPO loss Evaluation ScienceWorld ALFWorld IID/OOD performance Results Up to 152% improvement Stable training dynamics Better convergence Key Insight Standard entropy methods fail in multi-turn settings
Q1
1. What is the main challenge that EPO addresses in multi-turn LLM agent training?
High computational costs of training
Exploration-exploitation cascade failure
Memory limitations in language models
Q2
2. In the experimental results, what was the most significant performance improvement achieved by EPO?
19.8% improvement on ALFWorld
152% improvement on ScienceWorld
50% improvement on both benchmarks
Q3
3. Which component of EPO helps prevent uncontrolled entropy growth in early training stages?
Adaptive phase-based weighting
Multi-turn entropy regularization
Entropy smoothing regularizer with historical bounds