1. 📘 Topic and Domain: The paper presents Intern-S1-Pro, a one-trillion-parameter scientific multimodal foundation model designed for AI for Science (AI4S), covering chemistry, materials, life sciences, earth sciences, and general reasoning.
2. 💡 Previous Research and New Ideas: Based on prior work in LLMs, VLMs, MoE architectures, and scientific AI, the paper introduces expert expansion with grouped routing to balance expert load in large MoE models, a Straight-Through Estimator for router optimization, Fourier Position Encoding (FoPE) for physical signals, a dedicated time-series encoder with adaptive subsampling, and a specialized scientific caption pipeline for high-quality image-text alignment.
3. ❓ Problem: The paper aims to solve the challenge of scaling a scientific multimodal foundation model to one trillion parameters while maintaining training stability and efficiency, addressing expert load imbalance in MoE models and improving scientific visual understanding across over 100 specialized tasks.
4. 🛠️ Methods: The methods include expert expansion from Intern-S1 using grouped routing for absolute load balancing, Straight-Through Estimator for router gradient estimation, a Native ViT encoder for vision, FoPE for positional encoding, a dynamic time-series encoder with adaptive subsampling, a scientific caption pipeline (MinerU + CapRL/InternVL3.5), and stable mixed-precision RL training (FP8 with BF16/FP32 precision handling) using XTuner and LMDeploy infrastructure.
5. 📊 Results and Evaluation: Evaluated on scientific benchmarks (SciReasoner, SFE, SmolInstruct, MatBench, Mol-Instructions, etc.) and general benchmarks (MMMU-Pro, MMLU-Pro, AIME-2025, GAIA, etc.), Intern-S1-Pro outperforms proprietary models like Gemini-3-Pro and GPT-5.2 on scientific tasks (e.g., SciReasoner 55.5 vs. 14.7 for Gemini-3-Pro) and achieves top-tier open-source performance, with strong time-series understanding on SciTS benchmarks.