2025-09-05 Papers

1/2

Paper 1

From Editor to Dense Geometry Estimator

Published: 2025-09-04

Link: http://arxiv.org/pdf/2509.04338

1. 📘 Topic and Domain: Dense geometry prediction (depth and normal estimation) from single images using image editing models.
2. 💡 Previous Research and New Ideas: Based on text-to-image generative models for dense prediction; newly proposes using image editing models instead of generative models as they better align with image-to-image tasks.
3. ❓ Problem: Existing generative models lack inherent understanding of geometric cues from input images, leading to suboptimal performance in dense geometry estimation.
4. 🛠️ Methods: Adapts Step1X-Edit model using consistent velocity flow matching, logarithmic quantization for precision, and cost-free joint estimation of depth and normals through global attention.
5. 📊 Results and Evaluation: Achieves over 35% performance improvement on ETH3D dataset and outperforms DepthAnything series (trained on 100x more data) across multiple zero-shot depth and normal estimation benchmarks.

From Editor to Dense Geometry Estimator

FE2E: From Editor to Dense Geometry Estimator Fine-tuning Analysis Editor vs Generator • Feature Evolution • Training Dynamics • Performance Gap Three Key Adaptations Consistent Velocity Flow Matching Fixed Start Point v = z₁ - z₀ Logarithmic Quantization BF16 Precision D_log = ln(D + 1e-6) Cost-Free Joint Estimation Depth + Normal Global Attention FE2E Architecture Step1X-Edit Base DiT Backbone VAE Encoder/Decoder LoRA Fine-tuning FE2E Processing Pipeline Input Image x ∈ R^(H×W×3) VAE Encoder E(x) → z^x DiT f_θ Flow Matching v = f_θ(z^x) Depth Output Normal Output VAE Decoder D(ẑ^y) → ŷ Predictions Depth & Normal Training Data Hypersim + Virtual KITTI Loss Function L = ||v_D - p_l||² + ||v_N - p_r||² Experimental Results Depth Estimation ETH3D: 35% improvement KITTI: 10% improvement AbsRel: 3.8 (ETH3D) Normal Estimation SoTA performance 4 benchmarks MeanErr: 13.8 (ScanNet) Data Efficiency 71K training images vs 62.6M (DepthAnything) 0.1% of data used Ablation Studies Consistent Velocity: +7% Log Quantization: +19% Joint Training: +5% Key Findings • Editing models provide better foundation than generators for dense prediction • Consistent velocity training improves stability and reduces inference errors • Joint estimation with global attention enables mutual enhancement without extra cost
Q1
1. What is the main advantage of using image editing models over generative models for dense geometry prediction according to the paper?
They require less training data
They have inherent structural priors and better understanding of input images
They are computationally more efficient
Q2
2. Which technical innovation did the authors introduce to handle the precision requirements of depth estimation?
Increased model parameters
Used FP32 precision
Implemented logarithmic quantization
Q3
3. What remarkable achievement did FE2E demonstrate regarding data efficiency?
It achieved similar results using 100x less data than DepthAnything
It required no training data at all
It needed twice as much data as previous methods
1/2

Paper 2

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Published: 2025-09-03

Link: http://arxiv.org/pdf/2509.03867

1. 📘 Topic and Domain: The paper introduces "Drivelology" - the study of nonsensical yet meaningful language expressions - in the domain of natural language processing and linguistic analysis.
2. 💡 Previous Research and New Ideas: Based on previous research on humor, sarcasm and irony detection, this paper proposes a novel concept of "nonsense with depth" that goes beyond simple semantic inversion or contradiction.
3. ❓ Problem: The paper aims to evaluate whether large language models can truly understand and reason about linguistically complex expressions that appear nonsensical but contain deeper meaning.
4. 🛠️ Methods: The authors created DRIVEL HUB - a multilingual dataset of 1,200 examples with expert annotations, and designed four tasks (detection, tagging, narrative writing, selection) to evaluate LLMs' comprehension abilities.
5. 📊 Results and Evaluation: The results showed that current LLMs struggle with understanding deeper semantic layers of Drivelology, with even top models achieving limited performance, especially on harder reasoning tasks requiring cultural context and pragmatic understanding.

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Drivelology Research Methodology Flow Data Collection Social Media Platforms 6 Languages, 1200 samples Drivelology Categories • Misdirection • Paradox • Switchbait • Inversion • Wordplay "Nonsense with Depth" Annotation Process 7 Multilingual Experts 4-Step Protocol Quality Verification DRIVEL HUB Dataset 600 Drivelology 600 Non-Drivelology Four Evaluation Tasks Task 1: Detection Binary Classification Drivelology vs Non-Drivelology Task 2: Tagging Multi-label Classification Assign Categories (5 Types) Task 3: Narrative Writing Generative Task Explain Implicit Meaning Task 4: Selection Multiple Choice QA Easy & Hard Settings 5 Options Model Evaluation (Zero-shot) GPT-4, Claude-3, DeepSeek-V3, Qwen3, Llama3.1, etc. Multilingual Prompting (English & Mandarin) Evaluation Metrics • Accuracy (Detection/MCQA) • F1 Score (Tagging) • BERTScore & GPT-4 Judge Key Findings • DeepSeek-V3 performs best overall • Significant performance gap in Hard MCQA • Prompt language affects performance • Model size scaling benefits complex reasoning Analysis & Discussion • Model reasoning patterns • Cultural knowledge impact • Human annotation challenges Conclusion: LLMs struggle with pragmatic understanding Need for models beyond statistical pattern-matching
Q1
1. What makes Drivelology fundamentally different from traditional humor and sarcasm studies?
It focuses on multilingual content
It involves complex, non-linear narratives with deeper ambiguities
It only studies internet language patterns
Q2
2. In the DRIVEL HUB dataset annotation process, what was particularly challenging?
Finding enough multilingual annotators
Converting text to digital format
Reaching consensus on the subtle and subjective nature of implicit meanings
Q3
3. What was a key finding about LLMs' performance on Drivelology tasks?
They performed perfectly on all tasks
They struggled most with Hard narrative selection tasks requiring cultural context
They only succeeded with English language examples
1/2

Paper 3

Towards a Unified View of Large Language Model Post-Training

Published: 2025-09-04

Link: http://arxiv.org/pdf/2509.04419

1. 📘 Topic and Domain: Theoretical unification of large language model post-training methods, specifically focusing on supervised fine-tuning (SFT) and reinforcement learning (RL) approaches in machine learning.
2. 💡 Previous Research and New Ideas: Based on existing SFT and RL post-training methods; proposes a novel unified theoretical framework showing these approaches are instances of a single optimization process rather than contradictory methods.
3. ❓ Problem: Addresses the lack of theoretical understanding of why SFT and RL can be effectively combined in LLM training, and aims to create a more efficient alternative to the resource-intensive sequential SFT-then-RL pipeline.
4. 🛠️ Methods: Introduces a Unified Policy Gradient Estimator (UPGE) that combines four components (stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient), and develops Hybrid Post-Training (HPT) algorithm that dynamically switches between SFT and RL based on performance feedback.
5. 📊 Results and Evaluation: HPT consistently outperformed baselines across six mathematical reasoning benchmarks and two out-of-distribution suites, achieving a 7-point gain over the strongest baseline on AIME 2024 using Qwen2.5-Math-7B, and showed substantial improvements on smaller models like Qwen2.5-Math-1.5B and Llama3.1-8B.

Towards a Unified View of Large Language Model Post-Training

Unified Policy Gradient Estimator Framework Data Source Selection Online Rollout Offline Demo Reference Policy π_ref Calculation 1/π_ref Advantage Estimate  Calculation GAE / GRPO Stabilization Mask 1_stable Clipping Unified Policy Gradient Estimator grad_uni = 1_stable × (1/π_ref) × Â × ∇π_θ Algorithm Instances SFT π_ref = π_θ  = 1 1_stable = 1 Offline Data PPO π_ref = π_θold  = GAE 1_stable = Clip Online Data GRPO π_ref = π_θold  = Normalized 1_stable = Clip Online Data LUFFY π_ref = 1  = Normalized 1_stable = 1 Mixed Data HPT Dynamic π_ref Dynamic  Adaptive Gate Performance Feedback Hybrid Post-Training (HPT) Algorithm Performance Feedback: P = (1/n) Σ v(τᵢ) Dynamic Coefficients: α = f(P), β = g(P) Mixed Loss: L = α × L_RL + β × L_SFT Gate Function: Switch between SFT (exploitation) and RL (exploration) Benefits • Unified theoretical framework • Dynamic adaptation • Balanced exploration/exploitation Results • 7-point gain on AIME 2024 • Outperforms SFT→GRPO • Enhanced Pass@k performance
Q1
1. What is the key innovation of the Unified Policy Gradient Estimator (UPGE) framework?
It combines SFT and RL into a sequential pipeline
It shows that SFT and RL are instances of the same optimization process
It eliminates the need for supervised learning entirely
Q2
2. How does the Hybrid Post-Training (HPT) algorithm determine when to use SFT versus RL?
It uses a fixed schedule alternating between SFT and RL
It randomly switches between the two methods
It dynamically switches based on real-time performance feedback
Q3
3. What unexpected finding was observed regarding HPT's Pass@k performance?
HPT achieved lower Pass@k than both SFT and RL individually
HPT's Pass@k fell exactly between SFT and RL's performance
HPT achieved higher Pass@k than both SFT and RL individually