2025-06-24 Papers

1/2

Paper 1

Light of Normals: Unified Feature Representation for Universal Photometric Stereo

Published: 2025-06-23

Link: http://arxiv.org/pdf/2506.18882

1. 📘 Topic and Domain: Universal photometric stereo - a computer vision technique for reconstructing 3D surface normals from multiple images captured under varying lighting conditions.
2. 💡 Previous Research and New Ideas: Based on previous encoder-decoder approaches like UniPS and SDM-UniPS; introduces new light register tokens and wavelet transforms to better decouple lighting from surface features.
3. ❓ Problem: Addresses two key challenges: 1) Decoupling illumination variations from surface normal features, and 2) Preserving high-frequency geometric details in complex surfaces.
4. 🛠️ Methods: Employs LINO-UniPS architecture with: light register tokens and global cross-image attention for lighting-normal decoupling, wavelet transform for detail preservation, and normal-gradient confidence loss; also introduces PS-Verse dataset with graded geometric complexity.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance on public benchmarks (DiLiGenT, LUCES), with improved feature consistency (higher CSIM/SSIM scores) and better normal reconstruction accuracy compared to existing methods, especially for complex geometries.

Light of Normals: Unified Feature Representation for Universal Photometric Stereo

LINO-UniPS: Light of Normals Workflow Multi-light Images (B×F×H×W×3) Light Registered Wavelet-aware DownSampler (DWT + BiLinear) DINOv2 Backbone (Feature Extract) Light Registers (HDRI, Point, Area) Enhanced Light-Normal Contextual Attention Frame → Light-axis → Global → Light-axis (4 Interleaved Blocks) Light Aligner (Cosine Similarity Loss) DPT-Based Fusion Module (Multi-level Aggregate) WaveUpSampler (IDWT + Fusion) (H×W×C) Decoder (Pixel-sampling Transformer) Similar to SDM-UniPS Normal Gradient Perception Loss C = e^G̃ (Confidence Map) Surface Normals (H×W×3) PS-Verse Dataset • 100K scenes with graduated complexity • 17,805 textured 3D models • Normal maps for fine details • Diverse lighting conditions Key Innovations 1. Light register tokens for decoupling 2. Wavelet transform preservation 3. Global cross-image attention 4. Normal-gradient confidence loss Performance • SOTA on DiLiGenT, LUCES • Higher CSIM/SSIM scores • Better feature consistency • Superior generalization Legend Input/Output Processing Attention Dataset/Loss
Q1
1. What are the two fundamental challenges that LINO-UniPS aims to address in universal photometric stereo?
Low computational efficiency and limited dataset size
Deep coupling between illumination and surface normal features, and preservation of high-frequency geometric details
Camera calibration errors and insufficient lighting conditions
Q2
2. What innovative technique does LINO-UniPS use to preserve high-frequency surface details during feature processing?
Wavelet transform-based sampling instead of traditional bilinear interpolation
Multiple convolutional layers with residual connections
Gaussian blur filters applied to input images
Q3
3. How many complexity levels does the PS-Verse dataset contain, and what makes Level 5 unique?
4 levels total, with Level 4 featuring the most complex lighting conditions
6 levels total, with Level 5 containing only metallic materials
5 levels total, with Level 5 being the first to use normal maps for enhanced surface details
1/2

Paper 2

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

Published: 2025-06-23

Link: http://arxiv.org/pdf/2506.18841

1. 📘 Topic and Domain: The paper focuses on ultra-long text generation using large language models through reinforcement learning in the domain of natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous approaches like LongWriter that used supervised fine-tuning on synthetic data, this paper proposes a novel incentivization-based approach using reinforcement learning without relying on annotated or synthetic data.
3. ❓ Problem: The paper aims to solve the challenges of ultra-long text generation, including maximum length limits and quality degradation as sequence length increases in large language models.
4. 🛠️ Methods: The authors use Group Relative Policy Optimization (GRPO) for RL training, with specialized reward models targeting length control, writing quality, and structural formatting, combined with continual pretraining and a "think" prompting strategy.
5. 📊 Results and Evaluation: LongWriter-Zero, trained from Qwen2.5-32B, outperformed traditional SFT methods and achieved state-of-the-art results on WritingBench and Arena-Write benchmarks, surpassing even 100B+ models like DeepSeek R1 and Qwen3-235B.

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

LongWriter-Zero: RL-based Ultra-Long Text Generation Base Model Qwen2.5-32B Starting Point Continual Pretraining 30B tokens Long books + articles 1% CoT data RL Training Setup GRPO Algorithm Think Prompt 150 RL steps LongWriter-Zero Final Model SOTA Performance RQ1: Reward Design Length Target range matching Writing Quality & coherence Format Structure & consistency Advantage-based balancing RQ2: Test-time Scaling Think Prompt <think> Planning <answer> Direct Answer No thinking Direct output Think achieves higher performance RQ3: Impact Analysis Base-nothink 700 Elo Base-think 1200 Elo Continual- Pretrain-think 1400 Elo Continual pretraining raises performance ceiling Training Components Query Source WildChat-1M LMSYS-Chat-1M GRPO Algorithm Group advantages Normalized rewards Training Config Max 14K tokens T=0.8, top-p=1.0 8 nodes, 8×H800 Evaluation Arena-Write WritingBench Key Results WritingBench: 8.69 Best overall score Arena-Write: 1447 Highest Elo rating RL > SFT Outperforms SFT vs 100B+ models Beats larger models Key Innovation First RL-only approach for ultra-long text generation No synthetic data dependency • Multi-reward balancing • Test-time scaling
Q1
1. What is the key innovation of LongWriter-Zero compared to previous approaches like LongWriter?
It uses reinforcement learning from scratch without relying on synthetic or annotated data
It increases the maximum token length from 10K to 50K tokens
It combines multiple language models to generate longer texts
Q2
2. Which three components does the paper identify as critical for maximizing RL effectiveness in long-form generation?
Data augmentation, model scaling, and hardware optimization
Reward design, test-time scaling, and continual pretraining
Prompt engineering, fine-tuning, and ensemble methods
Q3
3. What surprising result did LongWriter-Zero achieve despite having only 32B parameters?
It matched the performance of GPT-4 on mathematical reasoning tasks
It outperformed 100B+ models like DeepSeek R1 and Qwen3-235B on long-form writing benchmarks
It reduced training time by 90% compared to traditional supervised fine-tuning
1/2

Paper 3

RLPR: Extrapolating RLVR to General Domains without Verifiers

Published: 2025-06-22

Link: http://arxiv.org/pdf/2506.18254

1. 📘 Topic and Domain: Reinforcement learning for language models, specifically extending RLVR (Reinforcement Learning with Verifiable Rewards) to general domains beyond mathematics and code.
2. 💡 Previous Research and New Ideas: Based on RLVR which uses domain-specific verifiers for reward signals, proposes using LLM's intrinsic probability of generating correct answers as reward signals instead of external verifiers.
3. ❓ Problem: RLVR's reliance on domain-specific verifiers limits its scalability and application to general domains, as creating verifiers for diverse natural language tasks is prohibitively complex.
4. 🛠️ Methods: Introduces RLPR framework that uses token probabilities of reference answers as rewards, implements reward debiasing to remove question/answer biases, and applies standard deviation filtering to stabilize training.
5. 📊 Results and Evaluation: Achieved consistent improvements across both mathematical and general reasoning tasks, surpassing verifier-based methods by 1.6 points on average across seven benchmarks and outperforming concurrent verifier-free approaches by 7.6 points on TheoremQA.

RLPR: Extrapolating RLVR to General Domains without Verifiers

RLPR Methodology Flow Chart Problem RLVR limited to verifiable domains Key Insight LLM intrinsic probability indicates reasoning quality RLPR Framework Verifier-free approach for general domains Probability Reward (PR) r = (1/|y*|) Σ p_i for o'_i ∈ y* Mean token probabilities of reference answer Reward Debiasing r̂ = clip(0,1, r - r') Remove bias from question and answer Std Dev Filtering Adaptive curriculum Remove low variance prompts dynamically RL Training with GRPO ∇J_RLPR(θ) = E[r̂ ∇log π_θ(o|x)] Optimize expected probability reward General Domain MMLU-Pro: 56.0 TheoremQA: 55.4 Math Domain MATH-500: 78.0 Minerva: 56.5 Comparison Outperforms General Reasoner Key Components & Innovations 1 Probability-based Reward: Uses mean token probabilities instead of sequence likelihood More robust to length variations and synonyms 2 Debiasing Method: Removes bias from question and reference answer characteristics Computes advantage over direct answer without reasoning 3 Adaptive Filtering: Dynamic threshold based on exponential moving average Removes prompts with low reward standard deviation 4 Domain Agnostic: No external verifiers needed for any domain Leverages intrinsic LLM capabilities
Q1
1. What is the primary limitation that prevents existing RLVR methods from being applied to general domains?
Heavy reliance on domain-specific verifiers that are complex to create for natural language tasks
Insufficient computational resources for training on diverse datasets
Lack of high-quality training data in general domains
Q2
2. In RLPR's probability-based reward calculation, why does the paper use mean probabilities (fseq = 1/|y*| ∑P) instead of sequence likelihood (normalized product)?
Mean probabilities require less computational overhead during training
Sequence likelihood is overly sensitive to minor variations and introduces high variance
Mean probabilities work better with the GRPO algorithm's group normalization
Q3
3. How much improvement did RLPR achieve compared to the concurrent verifier-free method VeriFree on TheoremQA?
1.6 points improvement
4.2 points improvement
7.6 points improvement