2025-06-26 Papers

1/2

Paper 1

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Published: 2025-06-25

Link: http://arxiv.org/pdf/2506.20512

1. 📘 Topic and Domain: The paper explores mid-training strategies for improving reinforcement learning (RL) performance in language models, specifically focusing on mathematical reasoning capabilities.

2. 💡 Previous Research and New Ideas: Based on previous research showing divergent RL performance between Llama and Qwen models, the paper proposes a novel two-stage mid-training strategy called "stable-then-decay" to enhance Llama's RL compatibility.

3. ❓ Problem: The paper addresses why different base language models (like Llama and Qwen) show varying behaviors during RL training, particularly for reasoning tasks, and how to make Llama more suitable for RL scaling.

4. 🛠️ Methods: The authors implemented a two-stage mid-training approach: first training models on 200B tokens with constant learning rate, then training on 20B tokens across three Chain-of-Thought focused branches with learning rate decay, followed by RL training.

5. 📊 Results and Evaluation: The resulting OctoThinker models showed 10-20% improvement over original base models and matched Qwen2.5's performance across 13 mathematical benchmarks, effectively closing the performance gap between Llama and more RL-friendly model families.

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

1/2

Paper 2

Unified Vision-Language-Action Model

Published: 2025-06-24

Link: http://arxiv.org/pdf/2506.19850

1. 📘 Topic and Domain: A unified vision-language-action (VLA) model for robotic manipulation that integrates vision, language, and action modalities into a single framework.

2. 💡 Previous Research and New Ideas: Based on previous VLA models that used separate encoders for vision and relied on language-centric paradigms; proposes a novel unified approach that represents all modalities as discrete tokens within a shared vocabulary.

3. ❓ Problem: Existing VLA models have limited cross-modal integration and struggle to capture temporal/causal dependencies in robot actions due to treating modalities separately.

4. 🛠️ Methods: Transforms vision, language and action signals into discrete tokens, uses an autoregressive transformer to model them as an interleaved sequence, and incorporates world model training on video data.

5. 📊 Results and Evaluation: Achieved state-of-the-art results across multiple benchmarks including CALVIN (4.61 avg length), LIBERO (95.5% success rate), and SimplerEnv-Bridge (69.8% success rate), significantly outperforming previous methods.

Unified Vision-Language-Action Model

1/2

Paper 3

Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset

Published: 2025-06-23

Link: http://arxiv.org/pdf/2506.18851

1. 📘 Topic and Domain: The paper focuses on developing a large-scale cross-pair dataset called Phantom-Data for subject-consistent video generation in computer vision and AI.

2. 💡 Previous Research and New Ideas: Based on existing in-pair training approaches and face-based datasets, it proposes a novel cross-pair dataset that spans diverse subject categories beyond just faces and includes varied contexts.

3. ❓ Problem: The paper addresses the "copy-paste problem" in subject-to-video generation, where models struggle to follow textual instructions while maintaining subject identity across different contexts.

4. 🛠️ Methods: The authors develop a three-stage pipeline: S2V Detection for subject identification, Contextually Diverse Retrieval from 53M videos and 3B images, and Prior-Based Identity Verification to ensure consistency.

5. 📊 Results and Evaluation: The approach achieved superior performance in text-video alignment and overall video quality while maintaining subject consistency, with their method receiving 76% preference in user studies compared to baselines under 12%.