2025-09-29 Papers

1/2

Paper 1

LongLive: Real-time Interactive Long Video Generation

Published: 2025-09-26

Link: http://arxiv.org/pdf/2509.22622

1. 📘 Topic and Domain: Real-time interactive long video generation using frame-level autoregressive models for AI-powered video content creation.

2. 💡 Previous Research and New Ideas: Built upon diffusion models and autoregressive video generation, introducing new techniques like KV-recache, streaming long tuning, and short window attention with frame sink.

3. ❓ Problem: Addressing the challenges of generating high-quality long videos efficiently while enabling real-time interactive control through prompt switching.

4. 🛠️ Methods: Implemented KV-recache to refresh cached states during prompt switches, streaming long tuning for train-long-test-long alignment, and short window attention with frame sink for faster generation.

5. 📊 Results and Evaluation: Achieved 20.7 FPS on a single NVIDIA H100 GPU, supported up to 240-second video generation, and outperformed baselines on VBench benchmarks while requiring only 32 GPU-days for training.

LongLive: Real-time Interactive Long Video Generation

1/2

Paper 2

Quantile Advantage Estimation for Entropy-Safe Reasoning

Published: 2025-09-26

Link: http://arxiv.org/pdf/2509.22611

1. 📘 Topic and Domain: Reinforcement learning for large language models, specifically focusing on entropy control in language model reasoning tasks.

2. 💡 Previous Research and New Ideas: Based on value-free RL methods like GRPO and DAPO, proposes a new Quantile Advantage Estimation (QAE) approach that replaces mean-baseline with group-wise K-quantile baseline.

3. ❓ Problem: Addresses the dual challenge of preventing both entropy collapse (premature convergence) and entropy explosion (uncontrolled exploration) in LLM reinforcement learning.

4. 🛠️ Methods: Implements a K-quantile baseline that creates a two-regime gate: reinforcing rare successes on hard queries and targeting remaining failures on easy queries, with theoretical guarantees for entropy safety.

5. 📊 Results and Evaluation: Achieved sustained pass@1 gains on Qwen3-8B/14B-Base across AIME'24/'25 and AMC'23 benchmarks, with roughly 80% of responses receiving zero advantage, demonstrating more efficient credit assignment.

Quantile Advantage Estimation for Entropy-Safe Reasoning

1/2

Paper 3

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Published: 2025-09-26

Link: http://arxiv.org/pdf/2509.22576

1. 📘 Topic and Domain: Training large language model (LLM) agents in multi-turn environments using reinforcement learning, focusing on entropy-regularized policy optimization.

2. 💡 Previous Research and New Ideas: Based on traditional reinforcement learning approaches like PPO and GRPO; proposes a new framework called EPO that introduces entropy smoothing regularization and adaptive phase-based weighting.

3. ❓ Problem: Addresses the "exploration-exploitation cascade failure" in multi-turn environments with sparse rewards, where agents either commit to flawed strategies too early or engage in chaotic exploration that destabilizes training.

4. 🛠️ Methods: Implements three mechanisms: entropy regularization across multi-turn settings, entropy smoothing regularizer to prevent abrupt fluctuations, and adaptive phase-based weighting to balance exploration and exploitation throughout training.

5. 📊 Results and Evaluation: Achieved up to 152% performance improvement on ScienceWorld and 19.8% on ALFWorld benchmarks compared to baselines, with significantly more stable training dynamics and better generalization to unseen tasks.