2026-03-27 Papers

1/2

Paper 1

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Published: 2026-03-26

Link: http://arxiv.org/pdf/2603.25040

1. 📘 Topic and Domain: The paper presents Intern-S1-Pro, a one-trillion-parameter scientific multimodal foundation model designed for AI for Science (AI4S), covering chemistry, materials, life sciences, earth sciences, and general reasoning.

2. 💡 Previous Research and New Ideas: Based on prior work in LLMs, VLMs, MoE architectures, and scientific AI, the paper introduces expert expansion with grouped routing to balance expert load in large MoE models, a Straight-Through Estimator for router optimization, Fourier Position Encoding (FoPE) for physical signals, a dedicated time-series encoder with adaptive subsampling, and a specialized scientific caption pipeline for high-quality image-text alignment.

3. ❓ Problem: The paper aims to solve the challenge of scaling a scientific multimodal foundation model to one trillion parameters while maintaining training stability and efficiency, addressing expert load imbalance in MoE models and improving scientific visual understanding across over 100 specialized tasks.

4. 🛠️ Methods: The methods include expert expansion from Intern-S1 using grouped routing for absolute load balancing, Straight-Through Estimator for router gradient estimation, a Native ViT encoder for vision, FoPE for positional encoding, a dynamic time-series encoder with adaptive subsampling, a scientific caption pipeline (MinerU + CapRL/InternVL3.5), and stable mixed-precision RL training (FP8 with BF16/FP32 precision handling) using XTuner and LMDeploy infrastructure.

5. 📊 Results and Evaluation: Evaluated on scientific benchmarks (SciReasoner, SFE, SmolInstruct, MatBench, Mol-Instructions, etc.) and general benchmarks (MMMU-Pro, MMLU-Pro, AIME-2025, GAIA, etc.), Intern-S1-Pro outperforms proprietary models like Gemini-3-Pro and GPT-5.2 on scientific tasks (e.g., SciReasoner 55.5 vs. 14.7 for Gemini-3-Pro) and achieves top-tier open-source performance, with strong time-series understanding on SciTS benchmarks.

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

1/2

Paper 2

Voxtral TTS

Published: 2026-03-26

Link: http://arxiv.org/pdf/2603.25551

1. 📘 Topic and Domain: The paper focuses on expressive multilingual text-to-speech (TTS) synthesis, specifically zero-shot voice cloning and natural speech generation.

2. 💡 Previous Research and New Ideas: Based on prior zero-shot TTS systems using discrete speech tokens and diffusion/flow-based acoustic modeling, the paper proposes a hybrid architecture combining auto-regressive semantic token generation with flow-matching for acoustic tokens, and introduces the Voxtral Codec with ASR-distilled semantic VQ and FSQ acoustic quantization.

3. ❓ Problem: The paper aims to solve the challenge of generating natural, expressive, and speaker-similar speech from very short reference audio (as little as 3 seconds) in a multilingual zero-shot setting, moving beyond autoregressive acoustic modeling.

4. 🛠️ Methods: The method uses a hybrid architecture: an auto-regressive decoder (Ministral 3B) for semantic tokens, a flow-matching transformer for acoustic tokens, and the Voxtral Codec for tokenization; training involves pretraining on pseudo-labeled audio-text pairs and Direct Preference Optimization (DPO) adapted for the hybrid discrete-continuous setting.

5. 📊 Results and Evaluation: Results show Voxtral TTS achieves a 68.4% win rate over ElevenLabs Flash v2.5 in human evaluations for voice cloning, with strong speaker similarity and intelligibility (WER) across 9 languages, and efficient inference via CUDA graph acceleration and asynchronous chunked streaming in vLLM-Omni.

Voxtral TTS

1/2

Paper 3

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Published: 2026-03-24

Link: http://arxiv.org/pdf/2603.23499

1. 📘 Topic and Domain: The paper focuses on optical flow estimation in computer vision, specifically addressing the challenge of accurately estimating motion in real-world videos suffering from degradations like blur, noise, and compression artifacts.

2. 💡 Previous Research and New Ideas: The paper builds upon RAFT-based optical flow methods and pretrained image restoration diffusion models (e.g., DiT4SR); it introduces the novel task of "Degradation-Aware Optical Flow" and proposes lifting an image restoration diffusion model to video via full spatio-temporal attention to create a hybrid architecture that fuses diffusion and CNN features.

3. ❓ Problem: It solves the problem of severe performance degradation experienced by existing optical flow models when applied to corrupted real-world inputs, where texture and motion boundaries are obscured.

4. 🛠️ Methods: The authors lift a DiT-based image restoration model to video using full spatio-temporal cross-frame attention, select the best layers for correspondence, upsample these features using DPT heads, and fuse them with conventional CNN features in a RAFT-based iterative refinement framework, training with pseudo ground-truth flow from high-quality videos.

5. 📊 Results and Evaluation: DA-Flow substantially outperforms state-of-the-art methods (RAFT, SEA-RAFT, FlowSeek) on Sintel, Spring, and TartanAir benchmarks under synthetic degradations, demonstrating lower End-Point Error (EPE) and fewer outlier pixels, validated through quantitative metrics and qualitative visual comparisons.