2026-03-27 Papers

1/2

Paper 1

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Published: 2026-03-26

Link: http://arxiv.org/pdf/2603.25040

1. 📘 Topic and Domain: The paper presents Intern-S1-Pro, a one-trillion-parameter scientific multimodal foundation model designed for AI for Science (AI4S), covering chemistry, materials, life sciences, earth sciences, and general reasoning.
2. 💡 Previous Research and New Ideas: Based on prior work in LLMs, VLMs, MoE architectures, and scientific AI, the paper introduces expert expansion with grouped routing to balance expert load in large MoE models, a Straight-Through Estimator for router optimization, Fourier Position Encoding (FoPE) for physical signals, a dedicated time-series encoder with adaptive subsampling, and a specialized scientific caption pipeline for high-quality image-text alignment.
3. ❓ Problem: The paper aims to solve the challenge of scaling a scientific multimodal foundation model to one trillion parameters while maintaining training stability and efficiency, addressing expert load imbalance in MoE models and improving scientific visual understanding across over 100 specialized tasks.
4. 🛠️ Methods: The methods include expert expansion from Intern-S1 using grouped routing for absolute load balancing, Straight-Through Estimator for router gradient estimation, a Native ViT encoder for vision, FoPE for positional encoding, a dynamic time-series encoder with adaptive subsampling, a scientific caption pipeline (MinerU + CapRL/InternVL3.5), and stable mixed-precision RL training (FP8 with BF16/FP32 precision handling) using XTuner and LMDeploy infrastructure.
5. 📊 Results and Evaluation: Evaluated on scientific benchmarks (SciReasoner, SFE, SmolInstruct, MatBench, Mol-Instructions, etc.) and general benchmarks (MMMU-Pro, MMLU-Pro, AIME-2025, GAIA, etc.), Intern-S1-Pro outperforms proprietary models like Gemini-3-Pro and GPT-5.2 on scientific tasks (e.g., SciReasoner 55.5 vs. 14.7 for Gemini-3-Pro) and achieves top-tier open-source performance, with strong time-series understanding on SciTS benchmarks.

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Intern-S1-Pro: Scientific Multimodal Foundation Model Workflow Architecture Expert Expansion Intern-S1 → Intern-S1-Pro (1T params) Grouped Routing Mechanism Load balancing • Training stability Straight-Through Estimator Dense gradient for router optimization Native Vision Transformer (ViT) Native resolution • 300M image-text pairs Fourier Position Encoding (FoPE) Wave-particle duality modeling Time-series Encoder Adaptive subsampling Pre-training (6T Tokens) Scientific Caption Pipeline • PDF Extraction (MinerU 2.5) • Content Deduplication (pHash) • Captioning (CapRL + InternVL3.5) • Quality Filtering (0.5B model) • Topic Clustering 270B scientific tokens Data Conflict Resolution • Structured Scientific Data Transform • Scientific Data Diversification • System Prompt Isolation Stable multimodal training General + Scientific Data Fusion Image-text pairs & text data 6T tokens total Post-training (RL at 1T Scale) Stable Mixed-Precision RL • Operator-level precision alignment • Rollout router replay for consistency • FP8 for expert MLP + BF16 others • FP32 LM head for numerical fidelity • Importance sampling with masking Train-inference consistency Infrastructure Co-design XTuner + LMDeploy 4× scale, ~20% efficiency loss Agentic RL Training Multi-step planning & execution Capabilities & Evaluation Scientific Tasks (100+ domains) • SciReasoner (55.5) - Scientific Reasoning • SmolInstruct (74.8) - Chemistry • MatBench (72.8) - Materials Science • Mol-Instructions (48.8) - Biomolecular • MicroVQA (63.3) - Microscopy • MSEarth (65.2) - Earth Science • Time-series: SciTS (EAU01: 99.5 F1) • Biology-Instruction (52.5) - Multi-omics General Capabilities • MMMU-Pro (72.8) - Knowledge & Reasoning • MMLU-Pro (86.6) - Multi-task Language • AIME-2025 (93.1) - Math Olympiad • RefCOCO (91.9) - Visual Grounding • IFBench (71.2) - Instruction Following • OCRBench V2 (60.1) - OCR • SArena (83.5) - SVG Generation • LCB V6 (74.3) - Code Generation Agent Capabilities • GAIA (77.4) - Real-world task solving • Tau2-Bench (80.9) - Conversational AI • ScreenSpot V2 (93.6) - GUI Grounding • Multi-step planning • Autonomous scientific workflows • Tool use & reasoning • Environmental grounding • Web search integration
Q1
1. What is the primary challenge that the Grouped Routing mechanism in Intern-S1-Pro addresses?
Reducing the total number of parameters in the model
Achieving absolute load balancing across devices in expert parallelism training
Improving the vision encoder's accuracy on natural images
Q2
2. Why was a specialized caption pipeline developed for scientific images instead of using existing web caption datasets?
Because web caption datasets were too large and caused memory issues
Because existing captions often have limited image-text alignment and lack the detail required for scientific visual content
Because the model needed captions in Chinese instead of English
Q3
3. On the SciReasoner benchmark, how did Intern-S1-Pro compare to Gemini-3-Pro?
Intern-S1-Pro scored slightly lower at 50.2 vs Gemini-3-Pro's 52.1
Intern-S1-Pro significantly outperformed Gemini-3-Pro with 55.5 vs 14.7
Both models achieved identical scores of 40.0
1/2

Paper 2

Voxtral TTS

Published: 2026-03-26

Link: http://arxiv.org/pdf/2603.25551

1. 📘 Topic and Domain: The paper focuses on expressive multilingual text-to-speech (TTS) synthesis, specifically zero-shot voice cloning and natural speech generation.
2. 💡 Previous Research and New Ideas: Based on prior zero-shot TTS systems using discrete speech tokens and diffusion/flow-based acoustic modeling, the paper proposes a hybrid architecture combining auto-regressive semantic token generation with flow-matching for acoustic tokens, and introduces the Voxtral Codec with ASR-distilled semantic VQ and FSQ acoustic quantization.
3. ❓ Problem: The paper aims to solve the challenge of generating natural, expressive, and speaker-similar speech from very short reference audio (as little as 3 seconds) in a multilingual zero-shot setting, moving beyond autoregressive acoustic modeling.
4. 🛠️ Methods: The method uses a hybrid architecture: an auto-regressive decoder (Ministral 3B) for semantic tokens, a flow-matching transformer for acoustic tokens, and the Voxtral Codec for tokenization; training involves pretraining on pseudo-labeled audio-text pairs and Direct Preference Optimization (DPO) adapted for the hybrid discrete-continuous setting.
5. 📊 Results and Evaluation: Results show Voxtral TTS achieves a 68.4% win rate over ElevenLabs Flash v2.5 in human evaluations for voice cloning, with strong speaker similarity and intelligibility (WER) across 9 languages, and efficient inference via CUDA graph acceleration and asynchronous chunked streaming in vLLM-Omni.

Voxtral TTS

Voxtral TTS - Workflow Architecture Voice Reference (3-30 seconds) 24kHz Mono Text Prompt Tokenized Input Voxtral Codec Encoder Patchification Transformer Blocks Causal CNNs Quantization VQ: Semantic (256-dim) FSQ: Acoustic (36×21) 2.14 kbps total Text Tokens Audio Tokens (37 per frame @ 12.5Hz) 1 Semantic Token + 36 Acoustic Tokens Generation Pipeline Decoder Backbone (Ministral 3B) Auto-regressive Semantic Token Gen Flow-Matching Transformer Continuous Space 8 NFEs + CFG Hidden State h Linear Head Output: Semantic + Acoustic Tokens Voxtral Codec Decoder Transposed CNNs Transformer Blocks Generated Speech 24kHz Waveform Training Pipeline Pretraining Paired Audio & Transcripts (A₁, T₂, A₂) tuples Cross-Entropy Loss Flow-Matching Loss DPO Fine-tuning Rejection Sampling Semantic DPO (β=0.1) Acoustic DPO (β=0.5) Combined with Pretrain Loss vLLM-Omni Serving CUDA Graph Acceleration 47% latency reduction Async Chunked Streaming Performance 70ms latency, 0.103 RTF Hybrid Architecture: AR Semantic Generation + Flow-Matching Acoustic Generation 9 Languages Supported Key Innovations Hybrid VQ-FSQ Quantization ASR-Distilled Semantic Tokens Flow-Matching for Acoustic DPO for Preference Alignment CUDA Graph Optimization 68.4% Win Rate vs ElevenLabs Flash v2.5
Q1
1. What quantization scheme does Voxtral Codec use for its semantic and acoustic tokens?
VQ for semantic tokens and FSQ for acoustic tokens
RVQ for semantic tokens and VQ for acoustic tokens
FSQ for both semantic and acoustic tokens
Q2
2. What was Voxtral TTS's win rate over ElevenLabs Flash v2.5 in human evaluations for multilingual zero-shot voice cloning?
58.3%
72.4%
68.4%
Q3
3. Which component did the authors use as the decoder backbone for Voxtral TTS?
Mistral Small
Ministral 3B
Whisper encoder
1/2

Paper 3

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Published: 2026-03-24

Link: http://arxiv.org/pdf/2603.23499

1. 📘 Topic and Domain: The paper focuses on optical flow estimation in computer vision, specifically addressing the challenge of accurately estimating motion in real-world videos suffering from degradations like blur, noise, and compression artifacts.
2. 💡 Previous Research and New Ideas: The paper builds upon RAFT-based optical flow methods and pretrained image restoration diffusion models (e.g., DiT4SR); it introduces the novel task of "Degradation-Aware Optical Flow" and proposes lifting an image restoration diffusion model to video via full spatio-temporal attention to create a hybrid architecture that fuses diffusion and CNN features.
3. ❓ Problem: It solves the problem of severe performance degradation experienced by existing optical flow models when applied to corrupted real-world inputs, where texture and motion boundaries are obscured.
4. 🛠️ Methods: The authors lift a DiT-based image restoration model to video using full spatio-temporal cross-frame attention, select the best layers for correspondence, upsample these features using DPT heads, and fuse them with conventional CNN features in a RAFT-based iterative refinement framework, training with pseudo ground-truth flow from high-quality videos.
5. 📊 Results and Evaluation: DA-Flow substantially outperforms state-of-the-art methods (RAFT, SEA-RAFT, FlowSeek) on Sintel, Spring, and TartanAir benchmarks under synthetic degradations, demonstrating lower End-Point Error (EPE) and fewer outlier pixels, validated through quantitative metrics and qualitative visual comparisons.

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

DA-Flow: Method Workflow Low-Quality Video Input I_LQ^k, I_LQ^{k+1} Lifted Image Restoration Diffusion Model (D_φ) DiT4SR Backbone Full Spatio-Temporal Attention MM-DiT Layers with Cross-Frame Feature Extraction Query/Key Features DPT-based Upsampling Context Features CNN Encoder (RAFT) Image + Context Encoder Hybrid Feature Fusion Concat(Diffusion Features, CNN Features) RAFT-based Estimation Correlation Operator (C) Iterative Update Operator (U) Estimated Flow f_hat_{k→k+1} Training Pipeline Stage 1: L_diff Stage 2: L_flow Pseudo GT: SEA-RAFT Dataset: YouHQ Degradation: Real-ESRGAN Key Insights • Diffusion features encode degradation patterns • Full spatio-temporal attention enables temporal reasoning • Hybrid fusion combines robustness + spatial detail Evaluation Benchmarks: • Sintel (Clean + Final) • Spring • TartanAir Notation I_LQ^k, I_LQ^{k+1} LQ frame pair D_φ Diffusion model MM-DiT Multi-Modal DiT Q, K Query/Key features DPT Feature upsampler E_img, E_ctx CNN encoders C Correlation op. U Update op. L_diff Diffusion loss L_flow Flow loss M_θ = U ∘ C ∘ (Up(D_φ), E) → Degradation-Aware Optical Flow Estimation Full Pipeline: Lifted Diffusion Model + CNN Encoder → Hybrid Fusion → Correlation & Iterative Refinement Frame Pair
Q1
1. What new task does the DA-Flow paper introduce to address optical flow estimation in challenging conditions?
Video frame restoration using optical flow guidance
Degradation-Aware Optical Flow estimation
Zero-shot optical flow from diffusion features
Q2
2. Why did the authors choose to lift an image restoration diffusion model rather than using a video restoration diffusion model for their approach?
Video restoration models are too computationally expensive to train
Video restoration models compress frames into a shared latent, losing per-frame spatial structure needed for dense matching
Image restoration models have stronger generative priors than video models
Q3
3. In DA-Flow, which layers were selected based on the feature analysis for extracting diffusion features?
The first 4 layers (1, 2, 3, 4)
Layers {3, 13, 16, 17}
The last 4 layers (28, 29, 30, 31)