2025-04-22 Papers

1/2

Paper 1

FlowReasoner: Reinforcing Query-Level Meta-Agents

Published: 2025-04-21

Link: http://arxiv.org/pdf/2504.15257

1. 📘 Topic and Domain: The paper introduces FlowReasoner, a query-level meta-agent for automating the design of personalized multi-agent systems in the domain of AI agent systems.
2. 💡 Previous Research and New Ideas: The paper builds on previous task-level meta-agents that create fixed workflows for specific tasks, proposing instead a query-level approach that generates a unique multi-agent system for each individual user query through reasoning-based optimization.
3. ❓ Problem: The paper addresses the limitation of existing multi-agent systems that are either manually designed (requiring significant human effort) or task-level automated (creating one-size-fits-all systems that lack adaptability to individual queries).
4. 🛠️ Methods: The authors distill reasoning abilities from DeepSeek R1 to endow FlowReasoner with basic multi-agent system generation capabilities, then enhance it through reinforcement learning with external execution feedback using a multi-purpose reward focused on performance, complexity, and efficiency.
5. 📊 Results and Evaluation: FlowReasoner outperforms existing methods across engineering and competition code benchmarks, notably surpassing o1-mini by 10.52% accuracy across three benchmarks, while demonstrating superior adaptability by generating personalized workflows tailored to specific queries.

FlowReasoner: Reinforcing Query-Level Meta-Agents

FlowReasoner Methodology Flowchart Training and Inference Pipeline for Query-Level Meta-Agent Phase 1: Training FlowReasoner 1. Reasoning Data Distillation Use Teacher LLM (DeepSeek R1 671B) to generate multi-round reasoning & system data (+ initial feedback) 2. SFT Warmup Finetune Student LLM (DeepSeek-R1-Distill-Qwen-7B) on distilled data (D). Goal: Basic reasoning ability. Distilled Data 3. Reinforce Reasoning via RL (GRPO) Input: SFT Model, User Queries (q) Process: a) Sample multiple trajectories (oi) b) Execute in Sandbox c) Get External Feedback (Multi-Purpose Reward) Reward Components: Performance (Pass Rate) Complexity + Diversity d) Update policy via GRPO SFT Model Trained FlowReasoner Model Phase 2: Inference with FlowReasoner Input: New User Query (q) FlowReasoner Generates System (Using Trained Model from Phase 1) Process: Deliberative Reasoning (l-round optimization) Iterative Refinement using External Feedback (e.g., Pass Rate) Output: Query-Specific Multi-Agent System (S*query) System Execution S*query processes input query q Result: Final Answer (a) (Use Trained Model)
Q1
1. What is the key difference between FlowReasoner and previous task-level meta-agents?
FlowReasoner uses more complex search algorithms
FlowReasoner generates a personalized multi-agent system for each individual user query
FlowReasoner requires less computational resources
Q2
2. How does FlowReasoner enhance its reasoning capabilities after the initial training?
Through manual optimization by human experts
Through Monte Carlo Tree Search (MCTS)
Through reinforcement learning with external execution feedback
Q3
3. In the experimental evaluation, by what percentage did FlowReasoner outperform the o1-mini model across three benchmarks?
5.26%
10.52%
15.78%
1/2

Paper 2

Learning to Reason under Off-Policy Guidance

Published: 2025-04-21

Link: http://arxiv.org/pdf/2504.14945

1. 📘 Topic and Domain: The paper focuses on enhancing large language models' reasoning capabilities through reinforcement learning that integrates off-policy guidance.
2. 💡 Previous Research and New Ideas: The paper builds on zero-RL approaches that train reasoning models using only on-policy rollouts and rule-based rewards, and proposes LUFFY, a framework that incorporates off-policy reasoning traces from stronger models to expand learning beyond the model's initial capabilities.
3. ❓ Problem: The paper addresses the limitation of existing zero-RL methods which constrain learning to a model's own outputs, preventing acquisition of reasoning abilities beyond its initial capabilities.
4. 🛠️ Methods: The authors use a mixed-policy approach that combines off-policy demonstrations with on-policy rollouts during training, employing policy shaping via regularized importance sampling to emphasize low-probability but crucial actions.
5. 📊 Results and Evaluation: LUFFY achieves an average gain of over +7.0 points across six math benchmarks and +6.2 points on out-of-distribution tasks, outperforming both imitation-based supervised fine-tuning and existing zero-RL methods in both performance and generalization.

Learning to Reason under Off-Policy Guidance

LUFFY: Learning to Reason under Off-Policy Guidance - Method Flowchart Base Policy Model π_θ_old (e.g., Qwen2.5-Math) Off-Policy Traces τ_j ~ π_ϕ (e.g., DeepSeek-R1) Generate On-Policy Rollouts τ_i ~ π_θ_old Combine Samples (On-Policy Rollouts τ_i & Off-Policy Traces τ_j) Compute Mixed Advantage (Â) Based on Rewards R(τ) of combined set LUFFY Objective Calculation & Optimization On-Policy Signal Importance Sampling: r_i,t = π_θ / π_θ_old Objective Term: r_i,t * Â Modification: No Clipping (Allows larger updates) Off-Policy Signal Base Importance Sampling: r̂_j,t = π_θ / π_ϕ Modification: Policy Shaping Objective Term: f(r̂_j,t) * Â (f(x)=x/(x+γ), boosts low π_θ) + Update π_θ Trained LUFFY Model
Q1
1. What is the primary limitation of existing zero-RL methods that LUFFY aims to overcome?
High computational cost and training instability
Inability to learn reasoning abilities beyond the model's initial capabilities
Poor performance on simple mathematical problems
Q2
2. How does LUFFY's policy shaping mechanism enhance learning from off-policy traces?
By eliminating all low-probability actions from the model's policy
By assigning more importance to high-probability actions only
By amplifying learning signals for low-probability but crucial actions
Q3
3. What advantage did LUFFY demonstrate over supervised fine-tuning (SFT) in the experimental results?
Superior generalization capability, especially on out-of-distribution tasks
Significantly faster training times with less computational resources
Ability to completely eliminate hallucinations in mathematical reasoning
1/2

Paper 3

StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

Published: 2025-04-21

Link: http://arxiv.org/pdf/2504.15281

1. 📘 Topic and Domain: The paper presents StyleMe3D, a framework for transferring artistic styles to 3D Gaussian Splatting representations while preserving geometric integrity.
2. 💡 Previous Research and New Ideas: The paper builds upon 3D Gaussian Splatting and existing style transfer techniques, proposing a novel approach that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement.
3. ❓ Problem: The paper addresses the challenge of stylizing 3D Gaussian Splatting scenes with artistic styles while maintaining geometric details, semantic coherence, and visual harmony.
4. 🛠️ Methods: The authors use four key components: Dynamic Style Score Distillation (DSSD) for semantic alignment, Contrastive Style Descriptor (CSD) for content-aware textures, Simultaneously Optimized Scale (SOS) for detail preservation, and 3D Gaussian Quality Assessment (3DG-QA) for aesthetic quality.
5. 📊 Results and Evaluation: StyleMe3D outperforms state-of-the-art methods in preserving geometric details and ensuring stylistic consistency across scenes, achieving higher PSNR, SSIM, and LPIPS scores while maintaining real-time rendering capabilities.

StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

StyleMe3D Workflow: Stylizing 3D Gaussians Pre-trained 3D GS (Fixed Geometry Θ_geo) Style Reference (Image or Text Prompt) Style Purification (using CLIP) Isolate style embeddings, remove content Purified Style Embedding Main Optimization Loop (Optimize only Color Θ_color) Render Image (I_v) From current 3D GS (Θ_color) Style Info Initial Colors DSSD Dynamic Style Score Distillation (Stable Diffusion Prior) - High-level semantics - Dynamic CFG, Timesteps - Incl. Style Outpainting (PSO) (L_style) SOS Simultaneously Optimized Scale (VGG Prior) - Low-level texture details - Multi-scale Gram matrices (L_SOS) CSD Contrastive Style Descriptor (ViT Prior) - Mid-level style fidelity - Cosine similarity on style features (L_CSD) 3DG-QA 3D Gaussian Quality Assessment (CLIP-IQA Prior) - Global aesthetic quality - Antonym prompts, artifact removal (L_3DG-QA) Rendered Image (I_v) Combine Losses (L_final) λ1*L_style + λ2*L_SOS + λ3*L_CSD + λ4*L_3DG-QA Calculate Gradients ∇Θ_color Update Colors Θ_color Iterate Stylized 3D Gaussian Splatting (Optimized Θ_color) Convergence
Q1
1. What is the primary innovation of StyleMe3D compared to previous 3D stylization approaches?
Using only VGG-based feature extraction for style transfer
Integration of Stable Diffusion into 3D Gaussian Splatting optimization
Complete modification of geometry during stylization
Q2
2. Which component of StyleMe3D is specifically designed to extract medium-level style descriptors for content-aware stylization?
Dynamic Style Score Distillation (DSSD)
Contrastive Style Descriptor (CSD)
3D Gaussian Quality Assessment (3DG-QA)
Q3
3. During the stylization process in StyleMe3D, which parameters of the 3D Gaussian Splatting representation are optimized?
Only the geometric parameters
Only the color parameters
Both geometric and color parameters