2025-07-07 Papers

1/2

Paper 1

IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction

Published: 2025-07-02

Link: http://arxiv.org/pdf/2507.02025

1. 📘 Topic and Domain: Development of IntFold, a controllable foundation model for biomolecular structure prediction in computational biology and drug discovery.
2. 💡 Previous Research and New Ideas: Based on AlphaFold 3's architecture for biomolecular structure prediction, introduces new controllable adapters for specialized tasks and a custom attention kernel.
3. ❓ Problem: Addresses the challenge of efficiently adapting large structure prediction models for specialized applications while maintaining high accuracy across general prediction tasks.
4. 🛠️ Methods: Implements modular adapters (LoRA architecture), custom FlashAttentionPairBias kernel, and a model-agnostic ranking method, trained on comprehensive datasets including PDB structures and specialized datasets.
5. 📊 Results and Evaluation: Achieves accuracy comparable to AlphaFold 3 across various biomolecular structures (protein-protein, protein-ligand, nucleic acids), with significant improvements in specialized tasks like allosteric state prediction and binding affinity estimation.

IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction

IntFold: Controllable Foundation Model Workflow Data Sources • Protein Data Bank (PDB) • AlphaFold Database • Disordered Protein PDB • Antibody-Antigen • Affinity Dataset • CDK2 Dataset • MSAs and Templates Core Model Architecture Embedding Trunk (Sequences + MSAs) Diffusion Block (Structure Generation) Confidence Head (pLDDT, pTM) Modular Adapter (Optional) FlashAttentionPairBias Kernel Specialized Applications Target-Specific Modeling (CDK2) Guided Folding with Constraints Binding Affinity Prediction Adapter Types Per-layer LoRA Adapters Post-Hoc Downstream Module Benchmarking Results Protein Systems: • Monomers: LDDT 0.88 • Protein-Protein: 72.9% • Ab-Ag: 37.6% → 43.2% Protein-Ligand: • Success Rate: 58.5% • IntFold+: 61.8% • PoseBusters: 76.1% Nucleic Acids: • Protein-DNA: 74.1% • Protein-RNA: 58.9% • RNA LDDT: 0.63 Compared to: • AlphaFold 3 • Boltz-1, Boltz-2 • Chai-1 • HelixFold 3 • Protenix Technical Innovations Custom Attention Kernel Model-Agnostic Ranking Stability Improvements • Faster than DeepSpeed/NVIDIA • Training-free similarity method • Skip-and-recover mechanism Training Insights • Activation explosion issues • Gradient spike handling • Parametrization choices • Numerical considerations • Float32 for diffusion module Final Output 3D Biomolecular Structures Key Achievements ✓ Matches AlphaFold 3 performance ✓ Outperforms contemporary methods ✓ Controllable specialized applications ✓ Superior attention kernel efficiency
Q1
1. What is the key innovation that distinguishes IntFold from other biomolecular structure prediction models?
Its faster processing speed compared to AlphaFold 3
Its controllability through specialized adapters while keeping the base model frozen
Its ability to handle larger protein structures
Q2
2. Which technical innovation did IntFold introduce to improve computational efficiency?
A new data preprocessing pipeline
A quantum computing integration module
A custom FlashAttentionPairBias kernel for faster and more memory-efficient processing
Q3
3. How does IntFold handle the ranking of multiple structure predictions for a given target?
By using a deep learning confidence score
By selecting the structure with highest energy score
By using a training-free similarity-based method that compares multiple predictions
1/2

Paper 2

Ovis-U1 Technical Report

Published: 2025-06-28

Link: http://arxiv.org/pdf/2506.23044

1. 📘 Topic and Domain: A technical report introducing Ovis-U1, a 3-billion-parameter unified multimodal AI model for image understanding, text-to-image generation, and image editing.
2. 💡 Previous Research and New Ideas: Based on GPT-4o and previous Ovis models, proposing a new unified training approach starting from a language model instead of using a frozen multimodal language model.
3. ❓ Problem: Addressing how to endow a multimodal understanding model with image generation capabilities and effectively train a unified model on both understanding and generation tasks.
4. 🛠️ Methods: Implements a diffusion-based visual decoder with bidirectional token refiner, utilizing a 6-stage unified training process combining understanding, generation, and editing tasks.
5. 📊 Results and Evaluation: Achieves 69.6 on OpenCompass Multi-modal Academic Benchmark, 83.72 on DPG-Bench, 0.89 on GenEval, and scores of 4.00 and 6.42 on ImgEdit-Bench and GEdit-Bench-EN respectively, surpassing several state-of-the-art models.

Ovis-U1 Technical Report

Ovis-U1 Training Pipeline Stage 0 Visual Decoder Pretraining T2I Generation Stage 1 Adapter Pretraining Und., T2I, Editing Stage 2 Visual Encoder Alignment Und., T2I, Editing Stage 3 Understanding Learning Understanding Only Stage 4 Generation Learning T2I Generation Stage 5 Generation Fine-tuning T2I Gen., Editing Ovis-U1 Architecture LLM Qwen3-1.7B 1720M params Visual Encoder AIMv2-Large 578M params Adapter Visual-Text 135M params Refiner Bidirectional 81M params Visual Decoder MMDiT 1046M params Understanding Data COYO, Wukong, ShareGPT4V T2I Generation Data Laion-aes6, JourneyDB Image Editing Data OmniEdit, UltraEdit, SeedEdit Understanding MMB: 77.8 T2I Generation GenEval: 0.89 Image Editing ImgEdit: 4.00 Total Parameters 3.6B
Q1
1. What is the key innovation in Ovis-U1's training approach compared to previous models?
Using a frozen multimodal language model
Starting from a language model and training with unified tasks
Training only on image generation tasks
Q2
2. Which component is responsible for enhancing the interaction between textual and visual embeddings in Ovis-U1?
The visual encoder
The VAE decoder
The bidirectional token refiner
Q3
3. What is the total number of parameters in Ovis-U1 and how are they distributed?
3.6B parameters with 1.7B in LLM and 1.9B in other components
3B parameters evenly distributed across all components
3.6B parameters with 1B in visual decoder and 2.6B in other parts
1/2

Paper 3

Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search

Published: 2025-07-03

Link: http://arxiv.org/pdf/2507.02652

1. 📘 Topic and Domain: The paper presents HiRA, a hierarchical reasoning framework for deep search tasks in artificial intelligence that separates planning from execution.
2. 💡 Previous Research and New Ideas: Building on previous retrieval-augmented generation (RAG) and single-model reasoning approaches, it proposes a novel multi-agent hierarchical architecture that decouples high-level planning from specialized execution.
3. ❓ Problem: The paper addresses the limitations of current single-model approaches that struggle with handling both high-level planning and detailed execution simultaneously, leading to inefficient reasoning and limited scalability.
4. 🛠️ Methods: HiRA implements a three-tier architecture consisting of a Meta Reasoning Planner for task decomposition, an Adaptive Reasoning Coordinator for task delegation, and Domain-Specialized Executors for specialized task execution.
5. 📊 Results and Evaluation: Experiments on four complex cross-modal deep search benchmarks showed that HiRA significantly outperformed state-of-the-art RAG and agent-based systems, with notable improvements in both answer quality and system efficiency.

Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search

HiRA: Hierarchical Reasoning Framework Flow Meta Reasoning Planner • Task Decomposition • Strategic Planning • Answer Generation Adaptive Reasoning Coordinator • Task Assignment • Reasoning Distillation • Memory Management Domain-Specialized Executors • Expert Agents • Tool Integration • Specialized Reasoning Dual-Channel Memory System Fact Memory Resource Memory • Factual discoveries • Information resources Search Expert • Simple RAG • Deep Search (WebThinker) • Web Information Acquisition Code Expert • Python Interpreter • Mathematical Computation • File Processing • Data Analysis Multimodal Expert • Image Understanding • Video Analysis • Audio Processing • Cross-modal Fusion External Tools • Bing Search API • Web Browser • Code Sandbox • Multimodal Models Inference Process Flow 1 Query Input Task Analysis 2 Subtask Generation 3 Agent Selection 4 Task Execution 5 Result Distillation 6 Answer Generation Iterative Refinement
Q1
1. What is the main architectural innovation of HiRA compared to previous approaches?
It uses a single large language model for all tasks
It separates planning from execution using a three-tier architecture
It only focuses on web search capabilities
Q2
2. Why do traditional single-model approaches struggle with complex search tasks according to the paper?
They are too slow to process search results
They cannot handle multiple languages
They mix execution details with high-level reasoning, disrupting the core reasoning process
Q3
3. What component in HiRA is responsible for matching subtasks with the most appropriate expert agents?
The Meta Reasoning Planner
The Adaptive Reasoning Coordinator
The Domain-Specialized Executors