2026-02-13 Papers

1/2

Paper 1

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Published: 2026-02-12

Link: http://arxiv.org/pdf/2602.12036

1. 📘 Topic and Domain: The paper focuses on reinforcement learning for large language models, specifically developing a method called Composition-RL to better utilize verifiable prompts in mathematical reasoning tasks.

2. 💡 Previous Research and New Ideas: The paper builds on Reinforcement Learning with Verifiable Rewards (RLVR) and proposes Composition-RL, which automatically composes multiple existing problems into new verifiable questions to address the issue of "solve all" prompts that become uninformative during training.

3. ❓ Problem: The paper aims to solve the problem of diminishing effective training data in RLVR, where easy prompts with 100% success rates provide zero gradient signals and reduce the useful dataset size during training.

4. 🛠️ Methods: The authors use Sequential Prompt Composition (SPC) to combine K existing prompts into compositional prompts, then apply Group Relative Policy Optimization (GRPO) for RL training on these composed prompts with curriculum learning across different compositional depths.

5. 📊 Results and Evaluation: Composition-RL consistently outperforms standard RL training across 4B-30B models on mathematical benchmarks (AIME, Beyond AIME, IMOBench) and multi-task reasoning (GPQA, MMLU-Pro), with improvements ranging from +3.3% to +10.5% overall, and demonstrates effective cross-domain capabilities when composing physics and math problems.

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

1/2

Paper 2

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Published: 2026-02-12

Link: http://arxiv.org/pdf/2602.12205

1. 📘 Topic and Domain: This paper presents DeepGen 1.0, a lightweight unified multimodal model for image generation and editing in the computer vision domain.

2. 💡 Previous Research and New Ideas: The paper builds on existing VLM-DiT (Vision-Language Model - Diffusion Transformer) architectures but proposes Stacked Channel Bridging (SCB) with learnable "think tokens" to better align VLM and DiT representations across multiple layers rather than relying on single final layers.

3. ❓ Problem: The paper aims to solve the problem that current high-performing unified multimodal models require massive parameter scales (>10B) with prohibitive training costs, while smaller models consistently underperform across diverse generation and editing tasks.

4. 🛠️ Methods: The authors use a 5B parameter architecture (3B VLM + 2B DiT) with Stacked Channel Bridging for feature fusion, and employ a three-stage training strategy: alignment pre-training, joint supervised fine-tuning, and reinforcement learning with MR-GRPO using mixture rewards and auxiliary supervision.

5. 📊 Results and Evaluation: DeepGen 1.0 achieves competitive or superior performance compared to models 3-16× larger, outperforming HunyuanImage (80B) by 28% on WISE and Qwen-Image-Edit (27B) by 37% on UniREditBench, while being trained on only ~50M samples compared to billions used by competitors.

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

1/2

Paper 3

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Published: 2026-02-10

Link: http://arxiv.org/pdf/2602.09877

1. 📘 Topic and Domain: The paper examines the safety dynamics of self-evolving multi-agent AI systems built from large language models, focusing on the fundamental impossibility of maintaining safety in closed-loop agent societies.

2. 💡 Previous Research and New Ideas: The paper builds on existing multi-agent systems research (CAMEL, MetaGPT, Smallville) and proposes a novel information-theoretic framework demonstrating that self-evolving, isolated, and safe AI societies represent an impossible trilemma.

3. ❓ Problem: The paper aims to solve the problem of understanding why self-evolving AI agent societies inevitably experience safety degradation when operating in isolation without external human oversight.

4. 🛠️ Methods: The authors use information theory and thermodynamics to formalize safety as KL divergence from anthropic value distributions, analyze the Moltbook agent community qualitatively, and conduct quantitative experiments on RL-based and memory-based self-evolving systems.

5. 📊 Results and Evaluation: Results show three failure modes (cognitive degeneration, alignment failure, communication collapse) in Moltbook, progressive safety degradation in both experimental paradigms measured by increased jailbreak success rates and decreased truthfulness scores, confirming theoretical predictions of inevitable safety erosion.