2025-09-24 Papers

1/2

Paper 1

Reinforcement Learning on Pre-Training Data

Published: 2025-09-23

Link: http://arxiv.org/pdf/2509.19249

1. 📘 Topic and Domain: A new training paradigm called Reinforcement Learning on Pre-Training Data (RLPT) for optimizing Large Language Models.
2. 💡 Previous Research and New Ideas: Based on previous reinforcement learning approaches like RLHF and RLVR that rely on human annotation, this paper proposes using pre-training data directly for reinforcement learning without human feedback.
3. ❓ Problem: The growing disparity between computational resource scaling and finite high-quality text data availability that constrains conventional LLM training approaches.
4. 🛠️ Methods: Introduces next-segment reasoning objective with two tasks (Autoregressive Segment Reasoning and Middle Segment Reasoning) that rewards the model for accurately predicting subsequent text segments based on context.
5. 📊 Results and Evaluation: When applied to Qwen3-4B-Base, RLPT achieved significant improvements across multiple benchmarks (3.0-8.1 points on general domain tasks and 5.3-6.6 points on mathematical reasoning tasks) with favorable scaling behavior.

Reinforcement Learning on Pre-Training Data

RLPT: Reinforcement Learning on Pre-Training Data Data Preparation • Web Text Collection • Deduplication • PII Masking • Quality Filtering • Segmentation Text Segmentation Input: Raw Text Output: (s<i, si, si+1) Sentence-level segments using NLTK toolkit Cold-Start SFT Instruction-following capability initialization Batch size: 1024 LR: 2×10⁻⁵ 3 epochs ASR Task Autoregressive Segment Reasoning Predict si from s<i Complete next sentence given context Aligns with autoregressive generation MSR Task Middle Segment Reasoning Predict si from (s<i, si+1) Fill masked content using bidirectional context RL Training Policy: π_θ Batch size: 512 8 samples per prompt Temperature: 1.0 GRPO optimization LR: 1×10⁻⁶ No KL regularization Generative Reward Model Semantic Consistency Check Predicted vs Reference Prefix Reward Strategy Score: 1 (match) / 0 (no match) RLPT Objective J_RLPT(θ) = E_ASR[r(o,si)] + λE_MSR[r(o,si)] Interleaved ASR and MSR tasks λ ∈ (0,1) balances contributions Self-supervised reward signal General Domain MMLU, MMLU-Pro GPQA-Diamond KOR-Bench OlympiadBench Accuracy metric Math Reasoning MATH-500, AMC23 Minerva Math AIME24, AIME25 Pass@k metric n=64, temp=0.6 RLVR Extension RLPT as foundation for RLVR training Additional 2.3-3.7 improvements on AIME benchmarks Scaling Properties Power-law scaling with training tokens Favorable scaling trend Potential for continued gains Key Benefits • No human annotation required • Scalable on pre-training data • Enhanced reasoning capabilities
Q1
1. What is the main innovation of RLPT compared to previous reinforcement learning approaches?
It uses human feedback in a more efficient way
It eliminates the need for human annotation by deriving rewards from pre-training data
It focuses only on mathematical reasoning tasks
Q2
2. Which of the following components is NOT one of the two main tasks in RLPT's next-segment reasoning objective?
Autoregressive Segment Reasoning (ASR)
Middle Segment Reasoning (MSR)
Terminal Segment Reasoning (TSR)
Q3
3. When RLPT was applied to Qwen3-4B-Base, which benchmark showed the highest absolute improvement?
MMLU (3.0 points)
GPQA-Diamond (8.1 points)
AIME24 (6.6 points)
1/2

Paper 2

Do You Need Proprioceptive States in Visuomotor Policies?

Published: 2025-09-23

Link: http://arxiv.org/pdf/2509.18644

1. 📘 Topic and Domain: Visuomotor policies for robotic manipulation, investigating whether proprioceptive state inputs are necessary for effective robot control.
2. 💡 Previous Research and New Ideas: Based on traditional imitation-learning visuomotor policies that use both visual and proprioceptive state inputs; proposes a novel "State-free Policy" that relies solely on visual inputs.
3. ❓ Problem: Addresses the limitation of state-based policies that overfit to training trajectories and show poor spatial generalization when manipulating objects in new positions.
4. 🛠️ Methods: Implements a State-free Policy using relative end-effector action space and dual wide-angle wrist cameras for full task observation, removing proprioceptive state inputs entirely.
5. 📊 Results and Evaluation: Achieved significantly improved spatial generalization (85% success in height generalization vs 0% with state input, 64% in horizontal generalization vs 6%), better data efficiency, and enhanced cross-embodiment adaptation across various robotic manipulation tasks.

Do You Need Proprioceptive States in Visuomotor Policies?

State-free Visuomotor Policy Workflow Problem Identification State input causes overfitting Poor spatial generalization Key Conditions 1. Relative EEF Action Space 2. Full Task Observation State-free Policy Remove proprioceptive states Vision-only input Benefits Enhanced spatial generalization Better data efficiency Action Space Analysis • Relative EEF: ✓ • Absolute EEF: ✗ • Joint-angle: ✗ Camera Setup Dual wide-angle wrist cameras 120° × 120° FOV Policy Architecture • π₀ Policy • ACT • Diffusion Policy Evaluation Tasks • Pick & Place • Shirt Folding • Whole-body Manipulation Spatial Generalization Height: 0% → 85% success Horizontal: 6% → 64% success Data Efficiency Fewer demonstrations needed Reduced overfitting Cross-embodiment Better adaptation No state space alignment Implementation Details • Remove proprioceptive state input • Use relative EEF action space • Deploy dual wide-angle wrist cams • Ensure full task observation • Optional: Remove overhead camera Key Results • Maintained in-domain performance • Significant spatial generalization • Improved data efficiency • Better cross-embodiment transfer • Architecture-agnostic benefits Future Insights • Rethink sensor design • Overhead cameras may be harmful • Wrist cameras sufficient • Foundation for generalizable robotic learning systems Experimental Validation Real-world tasks: Pick Pen, Pick Bottle, Put Lid, Fold Shirt, Fetch Bottle Multiple robot embodiments: Dual-arm systems, Whole-body robots Simulation: LIBERO benchmark evaluation Consistent improvements across all architectures and tasks
Q1
1. What was the key finding about the overhead camera in State-free Policies?
It was essential for successful task completion
It actually reduced performance in challenging scenarios
It had no impact on performance either way
Q2
2. Which action representation space proved most effective for State-free Policies?
Absolute joint-angle action space
Relative joint-angle action space
Relative end-effector action space
Q3
3. What was an unexpected benefit of State-free Policies?
They required more training data than state-based policies
They enabled better cross-embodiment adaptation between different robots
They only worked with simple manipulation tasks
1/2

Paper 3

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

Published: 2025-09-23

Link: http://arxiv.org/pdf/2509.19297

1. 📘 Topic and Domain: Feed-forward 3D Gaussian Splatting for novel view synthesis using voxel-aligned prediction instead of traditional pixel-aligned approaches.
2. 💡 Previous Research and New Ideas: Based on previous pixel-aligned Gaussian Splatting methods, proposes a new voxel-aligned paradigm that predicts Gaussians from a 3D voxel grid rather than 2D pixels.
3. ❓ Problem: Addresses limitations of pixel-aligned methods including view-dependent density distributions, heavy reliance on input view numbers, and alignment errors in occluded or low-texture regions.
4. 🛠️ Methods: Uses a multi-view transformer for feature extraction, constructs 3D voxel features through unprojection, refines them with a sparse 3D U-Net, and predicts Gaussian parameters directly from the voxel grid.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance on RealEstate10K and ScanNet datasets with higher PSNR/SSIM scores while using fewer Gaussians, demonstrating better geometric consistency and efficiency than pixel-aligned methods.

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

VolSplat: Voxel-Aligned Feed-Forward 3D Gaussian Splatting Workflow Multi-view Input Images {I₁, I₂, ..., Iₙ} Camera Poses {P₁, P₂, ..., Pₙ} 2D Feature Extraction ResNet Backbone Cross-view Attention Cost Volume Plane Sweeping Feature Matching Depth Prediction Depth Module Per-view Depth Maps 3D Feature Construction Unproject to World Space Voxelization Feature Aggregation V_{i,j,k} = avg(features) Feature Refinement Sparse 3D U-Net Residual Learning V' = V + R(V) Multi-scale Fusion Voxel-aligned Gaussian Prediction Per-voxel Gaussians {μ, α, Σ, c} Adaptive Density 3D Gaussian Splatting Novel View Rendering Output Novel View Images 3D Reconstruction View-consistent Loss Function L = L_MSE + λ·L_LPIPS Photometric + Perceptual Key Innovation: Voxel-aligned vs Pixel-aligned ✓ Multi-view consistency ✓ Adaptive density ✓ Reduced alignment errors 2D Features Depth Maps 3D Voxel Features Supervision
Q1
1. What is the key innovation of VolSplat compared to previous feed-forward 3D Gaussian Splatting methods?
It uses more input camera views
It predicts Gaussians from a 3D voxel grid instead of 2D pixels
It has a larger neural network architecture
Q2
2. According to the experimental results, what advantage does VolSplat demonstrate over pixel-aligned methods when handling scene complexity?
It requires more Gaussians to represent scenes
It only works well for simple scenes
It adaptively controls Gaussian density based on scene complexity
Q3
3. When tested on the ACID dataset without fine-tuning (cross-dataset generalization), what characteristic did VolSplat demonstrate?
It completely failed to generalize
It showed higher sensitivity to domain shifts
It maintained significantly better performance than pixel-aligned methods