2025-11-27 Papers

1/2

Paper 1

ROOT: Robust Orthogonalized Optimizer for Neural Network Training

Published: 2025-11-25

Link: http://arxiv.org/pdf/2511.20626

1. 📘 Topic and Domain: Development of a robust optimization algorithm (ROOT) for training large language models, focusing on improving stability and efficiency in deep learning optimization.
2. 💡 Previous Research and New Ideas: Based on Muon optimizer and Newton-Schulz iteration methods, proposing new adaptive coefficients for matrix orthogonalization and outlier suppression mechanisms.
3. ❓ Problem: Addressing two key limitations in existing optimizers: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise during training.
4. 🛠️ Methods: Implements adaptive Newton iteration with dimension-specific coefficients for robust orthogonalization, and soft-thresholding for outlier suppression in gradient updates.
5. 📊 Results and Evaluation: Achieved superior performance across academic benchmarks compared to Muon and AdamW baselines, with improved convergence speed and training stability, demonstrating an average accuracy of 60.12% across various tasks.

ROOT: Robust Orthogonalized Optimizer for Neural Network Training

ROOT: Robust Orthogonalized Optimizer Problem Identification 1. Dimensional fragility 2. Outlier vulnerability Adaptive Newton Iteration Dimension-specific coefficients: X_k = a(m,n)X_{k-1} + b(m,n)X_{k-1}(X^T_{k-1}X_{k-1}) + c(m,n)X_{k-1}(X^T_{k-1}X_{k-1})² Proximal Optimization Soft-thresholding for outlier suppression: T_ε[x]_i = sign(x_i)·max(|x_i| - ε, 0) M_t = B_t + O_t decomposition ROOT Algorithm 1. Compute gradient G_t 2. Update momentum M_t = μM_{t-1} + G_t 3. Outlier separation O_t = T_ε[M_t], B_t = M_t - O_t 4. Robust orthogonalization B^orth_t = AdaNewton(B_t) 5. Update parameters θ_t = θ_{t-1} - ηB^orth_t Experimental Setup • FineWeb-Edu dataset • 1B Transformer model • 10B/100B tokens Evaluation Benchmarks • HellaSwag, ARC, BoolQ • PIQA, SciQ, WINO • Zero-shot evaluation Ablation Studies • Threshold sensitivity • Coefficient calibration • Vision task generalization Key Results • Superior convergence speed • Better final performance • Robust to noise Theoretical Guarantees E^(m,n) ≤ E^std Provably better orthogonalization Main Contributions • Algorithmic robustness • Optimization robustness • Unified framework • Extensive validation Validation Results ROOT achieves 60.12% average performance vs 59.59% (Muon) and 59.05% (AdamW) Demonstrates superior robustness and faster convergence in noisy scenarios
Q1
1. What are the two main robustness limitations that ROOT aims to address?
Memory efficiency and computational speed
Dimensional fragility in orthogonalization and vulnerability to outlier noise
Model convergence and gradient vanishing
Q2
2. In the ROOT optimizer's soft-thresholding mechanism, what percentile threshold value was identified as optimal for LLM pre-training?
p = 0.85
p = 0.90
p = 0.99
Q3
3. When testing ROOT on vision tasks using CIFAR-10, what unexpected finding was discovered?
The optimizer failed completely on vision tasks
It performed best with a higher quantile threshold of 0.95
It achieved highest accuracy (88.44%) with a lower quantile threshold of 0.85
1/2

Paper 2

Latent Collaboration in Multi-Agent Systems

Published: 2025-11-25

Link: http://arxiv.org/pdf/2511.20639

1. 📘 Topic and Domain: The paper focuses on enabling direct latent space collaboration between large language models in multi-agent systems, within the domain of natural language processing and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on previous research on text-based multi-agent LLM systems and single-model latent reasoning, this paper proposes a novel framework called LatentMAS that enables pure latent collaboration among multiple LLM agents without requiring text-based mediation.
3. ❓ Problem: The paper aims to overcome the inefficiencies and information bottlenecks of text-based collaboration between LLM agents by enabling them to collaborate directly in continuous latent space rather than through natural language.
4. 🛠️ Methods: The paper introduces LatentMAS, an end-to-end training-free framework that combines auto-regressive latent thoughts generation through last-layer hidden embeddings and cross-agent latent working memory transfer through shared KV caches.
5. 📊 Results and Evaluation: Across 9 benchmarks spanning math, science, commonsense reasoning and code generation, LatentMAS achieved up to 14.6% higher accuracy, reduced output token usage by 70.8%-83.7%, and provided 4×-4.3× faster end-to-end inference compared to baselines.

Latent Collaboration in Multi-Agent Systems

LatentMAS Framework Workflow Input Question Agent 1 - Latent Reasoning • Auto-regressive latent thoughts • Last-layer hidden states • Input-output alignment (W_a) • Generate m latent steps Latent Working Memory • Extract KV-caches from all layers • Preserve input + latent thoughts • Lossless information transfer • Layer-wise concatenation Agent 2 - Latent Reasoning • Inherit working memory • Generate new latent thoughts • Condition on previous agent • Update KV caches Collaborative Framework Sequential MAS: Planner → Critic → Refiner → Solver Hierarchical MAS: Math/Science/Code Agents → Summarizer Theoretical Foundations • Reasoning Expressiveness: O(d_h/log|V|) more efficient • Communication Fidelity: Lossless information transfer • Complexity: Lower than text-based MAS Performance Gains • Accuracy: +14.6% improvement • Token Usage: 70.8%-83.7% reduction • Speed: 4×-4.3× faster inference • Training-free framework • 9 benchmarks evaluation Final Output • Only last agent decodes to text • Semantic consistency verified • Enhanced system reasoning • Covers math, science, code • Outperforms text-based MAS Key Innovation Pure Latent Collaboration Agents communicate through continuous latent space instead of discrete text tokens Enables richer, more efficient multi-agent reasoning
Q1
1. What is the main innovation of LatentMAS compared to previous multi-agent LLM systems?
It enables agents to collaborate through text-based communication
It allows agents to directly collaborate in continuous latent space
It requires extensive training of the agents before deployment
Q2
2. According to the paper's results, what percentage of token usage reduction was achieved by LatentMAS?
30-40% reduction
50-60% reduction
70.8-83.7% reduction
Q3
3. How does LatentMAS achieve lossless information transfer between agents?
By converting all information into text format
By using shared latent working memory stored in KV caches
By training specialized neural networks for information transfer
1/2

Paper 3

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Published: 2025-11-26

Link: http://arxiv.org/pdf/2511.21579

1. 📘 Topic and Domain: Audio-visual content generation using diffusion models, focusing on synchronizing audio and video generation for both speech and environmental sounds.
2. 💡 Previous Research and New Ideas: Based on previous joint audio-video generation models that struggled with synchronization; proposes new cross-task synergy training and enhanced audio-visual alignment mechanisms.
3. ❓ Problem: Poor audio-video synchronization in existing open-source models due to "Correspondence Drift" during joint training and inefficient attention mechanisms.
4. 🛠️ Methods: Introduces three key innovations: Cross-Task Synergy training combining joint and single-modality generation, Global-Local Decoupled Interaction Module for temporal alignment, and Synchronization-Enhanced CFG for better audio-visual correspondence.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance on their new Harmony-Bench dataset, significantly outperforming existing methods in audio-visual synchronization while maintaining high generation quality for both speech and environmental sounds.

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Harmony: Audio-Video Generation Workflow Input Data Video + Audio Clips Speech Transcripts Reference Audio Dual-Branch Architecture Video Branch Wan2.2-5B Pre-trained Audio Branch MM-DiT Multi-encoder Cross-Task Synergy Training Joint Generation Audio + Video Audio-Driven Video Clean Audio → Video Video-Driven Audio Clean Video → Audio Global-Local Decoupled Interaction RoPE-Aligned Frame-wise Attention Global Style Alignment Resolves temporal misalignment Ensures precise synchronization Synchronization-Enhanced CFG Mute Audio Negative Anchor Static Video Negative Anchor Amplifies alignment signal during inference Synchronized Audio-Video Output High-fidelity Video Precise Temporal Alignment Clear Audio Synthesis Solves Correspondence Drift Multi-Task Learning Enhanced Guidance ARCHITECTURE TRAINING INFERENCE State-of-the-art Audio-Visual Synchronization through Cross-Modal Learning
Q1
1. What is the main challenge called that causes poor audio-video synchronization during joint training?
Temporal Misalignment
Correspondence Drift
Modal Desynchronization
Q2
2. How many test cases does the Harmony-Bench dataset contain for evaluating the model's performance?
50 test cases
100 test cases
150 test cases
Q3
3. Which component of Harmony's architecture helps achieve both holistic style consistency and precise temporal alignment?
Cross-Task Synergy Training
Global-Local Decoupled Interaction Module
Synchronization-Enhanced CFG