2025-06-26 Papers

1/2

Paper 1

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Published: 2025-06-25

Link: http://arxiv.org/pdf/2506.20512

1. 📘 Topic and Domain: The paper explores mid-training strategies for improving reinforcement learning (RL) performance in language models, specifically focusing on mathematical reasoning capabilities.
2. 💡 Previous Research and New Ideas: Based on previous research showing divergent RL performance between Llama and Qwen models, the paper proposes a novel two-stage mid-training strategy called "stable-then-decay" to enhance Llama's RL compatibility.
3. ❓ Problem: The paper addresses why different base language models (like Llama and Qwen) show varying behaviors during RL training, particularly for reasoning tasks, and how to make Llama more suitable for RL scaling.
4. 🛠️ Methods: The authors implemented a two-stage mid-training approach: first training models on 200B tokens with constant learning rate, then training on 20B tokens across three Chain-of-Thought focused branches with learning rate decay, followed by RL training.
5. 📊 Results and Evaluation: The resulting OctoThinker models showed 10-20% improvement over original base models and matched Qwen2.5's performance across 13 mathematical benchmarks, effectively closing the performance gap between Llama and more RL-friendly model families.

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

OctoThinker: Mid-training Incentivizes RL Scaling OBSERVATION Llama vs Qwen RL Behavior Gap CONTROLLED MID-TRAINING FACTORS Math Web Corpora QA Format Data Instruction Following Training Budget TWO-STAGE STRATEGY Stable-then-Decay 200B + 20B tokens STAGE 1: STABLE STAGE (200B tokens) MegaMath-Web Pro-Max (72.5%) DCLM Baseline (10%) Synthetic Data (17.5%) → OctoThinker-Base-Stable STAGE 2: DECAY STAGE (20B tokens) Cosine LR Decay + 3 Branches SHORT BRANCH MegaMath-QA OpenMathInstruct2 NuminaMath1.5 (30% QA) LONG BRANCH OpenR1-Math AM-DeepSeek Distilled-40M (30% QA) HYBRID BRANCH Mixed Short & Long CoT Data (30% QA) OctoThinker-Short OctoThinker-Long OctoThinker-Hybrid REINFORCEMENT LEARNING TRAINING GRPO Algorithm + Progressive Length Scheduler Complex Template + Stabilization Techniques MATH8K Dataset OCTOTHINKER-ZERO FAMILY OctoThinker-Short-Zero Fast, Concise Reasoning OctoThinker-Long-Zero Deep, Detailed Reasoning OctoThinker-Hybrid-Zero Balanced Reasoning Performance matches Qwen2.5 at same scale Key Findings: • High-quality math corpora (MegaMath-Web-Pro) crucial for RL success • QA format data improves RL, but distribution alignment matters • Instruction data unlocks QA potential • Scaling mid-training budget consistently improves RL performance
Q1
1. What was the key observation that motivated the authors to investigate mid-training strategies for different language model families?
Qwen models showed stable RL training with reasonable response length increases, while Llama models exhibited abnormal behavior with responses reaching 4,096 tokens and repetitive outputs
Llama models consistently outperformed Qwen models in mathematical reasoning but failed in other domains
Both Qwen and Llama models showed identical RL training dynamics but differed in their pre-training data quality
Q2
2. In the OctoThinker two-stage mid-training strategy, what happens during the 'decay stage'?
The model is trained on general web data with increasing learning rates to improve stability
The training is branched into three variants (Long, Short, Hybrid) with different CoT data mixtures and decayed learning rates over 20B tokens
The model undergoes reinforcement learning training directly without any additional pre-training data
Q3
3. Why did the authors name their model family 'OctoThinker'?
Because the model was trained on exactly 8 different datasets representing the 8 arms of an octopus
The 'Octo' represents the multi-armed octopus structure reflecting multiple branches, while 'Thinker' reflects the final RL stage where models learn to think and reason with self-reflection
It was named after the 8th version of their experimental framework that finally achieved success
1/2

Paper 2

Unified Vision-Language-Action Model

Published: 2025-06-24

Link: http://arxiv.org/pdf/2506.19850

1. 📘 Topic and Domain: A unified vision-language-action (VLA) model for robotic manipulation that integrates vision, language, and action modalities into a single framework.
2. 💡 Previous Research and New Ideas: Based on previous VLA models that used separate encoders for vision and relied on language-centric paradigms; proposes a novel unified approach that represents all modalities as discrete tokens within a shared vocabulary.
3. ❓ Problem: Existing VLA models have limited cross-modal integration and struggle to capture temporal/causal dependencies in robot actions due to treating modalities separately.
4. 🛠️ Methods: Transforms vision, language and action signals into discrete tokens, uses an autoregressive transformer to model them as an interleaved sequence, and incorporates world model training on video data.
5. 📊 Results and Evaluation: Achieved state-of-the-art results across multiple benchmarks including CALVIN (4.61 avg length), LIBERO (95.5% success rate), and SimplerEnv-Bridge (69.8% success rate), significantly outperforming previous methods.

Unified Vision-Language-Action Model

UniVLA: Unified Vision-Language-Action Model Workflow Pre-training Vision-Language Alignment (Emu3) Unified Multimodal Model Text Tokens Vision Tokens (VQ Encoder) Action Tokens (DCT/FAST) Autoregressive Transformer (8.5B params) Post-training: World Model 622K Robot Videos (Large-scale) Sequence: {L¹ₜ, L¹ᵥ, L²ᵥ, ..., Lᵗᵥ} Loss: Vision tokens only Learns temporal dynamics & causality Fine-tuning: Policy Learning Task-specific Datasets Sequence: {L¹ₜ, L¹ᵥ, L¹ₐ, L²ᵥ, L²ₐ, ...} Loss: Action tokens only Interleaved vision-action sequence Evaluation Benchmarks CALVIN (4.63 avg) LIBERO (95.5%) SimplerEnv (69.8%) Real Robot Autonomous Driving Key Features & Capabilities Unified token representation for all modalities Autoregressive sequence modeling World model learning from large-scale videos Multimodal outputs (vision, language, action) Causal temporal dynamics modeling State-of-the-art performance across benchmarks Technical Innovations DCT Action Encoding VQ Vision Tokenization Interleaved Sequence Design Two-stage Training Special tokens: <boi>, <eoi>, <boa>, <eoa> for modality boundaries
Q1
1. What is the key innovation that distinguishes UniVLA from previous vision-language-action models?
It uses a larger transformer with 8.5 billion parameters
It represents vision, language, and action as discrete tokens within a unified autoregressive framework
It only focuses on long-horizon robotic manipulation tasks
Q2
2. According to the experimental results, what was UniVLA's performance improvement on the LIBERO benchmark compared to the previous state-of-the-art π0-FAST?
Improved from 85.5% to 95.5% average success rate
Improved from 69.0% to 94.0% on long-horizon tasks only
Achieved exactly the same performance as π0-FAST
Q3
3. What post-training strategy did the authors find most effective for enhancing downstream policy learning?
Action prediction training using robotic demonstration data
Text-to-image generation training on static image datasets
World model training that captures video dynamics from large-scale video data
1/2

Paper 3

Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset

Published: 2025-06-23

Link: http://arxiv.org/pdf/2506.18851

1. 📘 Topic and Domain: The paper focuses on developing a large-scale cross-pair dataset called Phantom-Data for subject-consistent video generation in computer vision and AI.
2. 💡 Previous Research and New Ideas: Based on existing in-pair training approaches and face-based datasets, it proposes a novel cross-pair dataset that spans diverse subject categories beyond just faces and includes varied contexts.
3. ❓ Problem: The paper addresses the "copy-paste problem" in subject-to-video generation, where models struggle to follow textual instructions while maintaining subject identity across different contexts.
4. 🛠️ Methods: The authors develop a three-stage pipeline: S2V Detection for subject identification, Contextually Diverse Retrieval from 53M videos and 3B images, and Prior-Based Identity Verification to ensure consistency.
5. 📊 Results and Evaluation: The approach achieved superior performance in text-video alignment and overall video quality while maintaining subject consistency, with their method receiving 76% preference in user studies compared to baselines under 12%.

Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset

Phantom-Data Construction Pipeline Stage 1: S2V Detection 1. Frame Sampling (t=0.05, 0.5, 0.95) 2. Keyword Extraction (Qwen2.5) 3. Visual Grounding (Qwen2.5-VL) 4. Bbox Filtering (4%-90% coverage) 5. Visual-Semantic Recheck Stage 2: Contextually Diverse Retrieval Large-Scale Retrieval Bank 53M videos + 3B images Face Encoder (ArcFace) General Encoder (CLIP-based) Query-Based Retrieval Upper & Lower Similarity Bounds Cross-Context Candidate Selection Stage 3: Prior-Based Identity Verification Prior Knowledge Filtering Products: Logo check | Living: Same video VLM-Based Verification Identity consistency + Context diversity Final Cross-Pair Dataset ~1M identity-consistent pairs Data Sources • Koala-36M videos • Internal repositories • LAION-3B images • Scene segmentation • Quality filtering Subject Categories • Humans (face + body) • Animals • Products • Environments • Multi-subject scenes Quality Metrics • Completeness check • Specificity validation • Subject-text matching • Identity consistency • Context diversity Final Dataset • 1M cross-pair samples • 30K+ multi-subject • General-purpose • Publicly available • Cross-context diversity Key Innovation: Cross-Pair Training Reference subjects from DIFFERENT contexts than target video Reduces copy-paste problem while maintaining identity consistency Improves text alignment and visual quality Evaluation Results ✓ Better text alignment ✓ Improved visual quality ✓ Maintained identity consistency ✓ 76% user preference Technical Implementation • Phantom-wan model (1.3B params) • Rectified Flow training • 64 A100 GPUs, 30k iterations • 480p resolution Applications • Personalized advertising • AI-driven filmmaking • Digital content creation • Educational media Problem Solved: Copy-Paste Issue Traditional in-pair training copies background and context Cross-pair training preserves identity while enabling new contexts
Q1
1. What is the main problem that Phantom-Data aims to solve in subject-to-video generation?
The copy-paste problem where models replicate reference subjects without following textual instructions
Low video resolution and poor frame rate in generated videos
Inability to generate videos longer than 10 seconds
Q2
2. How large is the retrieval bank used in Phantom-Data's contextually diverse retrieval stage?
Over 10 million videos and 1 billion images
Over 53 million videos and 3 billion images
Over 100 million videos and 5 billion images
Q3
3. In the user study comparing different training approaches, what percentage of votes did the Phantom-Data method receive?
56%
65%
76%