2026-03-31 Papers

1/2

Paper 1

Towards a Medical AI Scientist

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28589

1. 📘 Topic and Domain: The paper introduces Medical AI Scientist, an autonomous framework for end-to-end clinical medical AI research automation spanning hypothesis generation, experimental validation, and manuscript drafting.

2. 💡 Previous Research and New Ideas: Based on existing AI Scientist systems (AI Scientist-v2, AI-Researcher, Agent Laboratory) and medical AI applications, the paper proposes a novel clinician-engineer co-reasoning mechanism and three operational modes (Reproduction, Innovation, Exploration) tailored specifically for clinical medicine.

3. ❓ Problem: Existing AI Scientists are domain-agnostic, lacking mechanisms to ground hypotheses in medical evidence, handle heterogeneous clinical data formats, or ensure ethical compliance, making them unsuitable for clinical autonomous research.

4. 🛠️ Methods: The framework comprises three components: an Idea Proposer with clinician-engineer co-reasoning for evidence-grounded hypothesis generation, an Experimental Executor orchestrating domain-specific medical toolboxes in Dockerized environments, and a hierarchical Manuscript Composer enforcing structured medical writing with ethical review.

5. 📊 Results and Evaluation: Across 171 evaluation cases (19 tasks, 6 modalities), the system outperforms GPT-5 and Gemini-2.5-Pro in idea quality across six dimensions; achieves 86-93% experimental success rates; and generates manuscripts scoring 4.60±0.56 (Stanford Agentic Reviewer), competitive with MICCAI/ISBI/BIBM publications under double-blind human evaluation, with one manuscript accepted at ICAIS 2025.

Towards a Medical AI Scientist

1/2

Paper 2

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28547

1. 📘 Topic and Domain: The paper focuses on the evaluation of general image editing models, specifically addressing visual consistency assessment, and belongs to the domain of computer vision and multimodal AI.

2. 💡 Previous Research and New Ideas: The paper builds on existing VLM-as-a-Judge evaluation paradigms but introduces two novel contributions: (1) GEditBench v2, a benchmark with 23 diverse tasks including an open-set category for real-world unconstrained editing, and (2) PVC-Judge, an open-source pairwise assessment model for visual consistency trained via two region-decoupled preference data synthesis pipelines.

3. ❓ Problem: The paper addresses the limitation that existing image editing benchmarks suffer from narrow task coverage and that current evaluation metrics fail to adequately capture visual consistency (preservation of identity, structure, and semantic coherence between edited and original images).

4. 🛠️ Methods: The authors develop GEditBench v2 with 1,200 real-world queries spanning 23 tasks, create PVC-Judge by fine-tuning Qwen3-VL-8B-Instruct with ~128k preference pairs constructed using object-centric, human-centric, and VLM-as-a-Judge pipelines, and establish VCReward-Bench with 3,506 expert-annotated preference pairs for meta-evaluation.

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

1/2

Paper 3

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28610

1. 📘 Topic and Domain: Adaptive resolution for efficient multimodal reasoning in video understanding with Multimodal Large Language Models (MLLMs).

2. 💡 Previous Research and New Ideas: - Previous: Various compression methods for MLLMs including token dropping, merging, and uniform resizing. - New: Input-side adaptation via a learned "Allocator" module that dynamically adjusts per-frame resolution based on content importance using a Beta policy, trained with joint RL (GRPO) and CAPO reward shaping.

3. ❓ Problem: MLLMs process video at uniform resolution, wasting computation on visually simple or redundant frames while potentially under-processing complex frames that require higher fidelity for accurate reasoning.

4. 🛠️ Methods: - Allocator module: Lightweight predictor using Beta distribution policy to output per-frame scale factors - Joint RL training with alternating optimization between Allocator and MLLM backbone - CAPO reward shaping: Combines task accuracy with cost penalty and group normalization - Temporal similarity regularizer (Lsim): Prevents collapse to uniform allocation - Resize instantiation: Bilinear rescaling of frames based on predicted scales

5. 📊 Results and Evaluation: - 83× reduction in backbone attention FLOPs at ρ=0.11 token retention - Token retention ratio ρ ∈ [0.06, 0.16] across benchmarks - 6-16× increase in effective temporal horizon at fixed compute - Maintains or improves accuracy across VideoMME, LongVideoBench, MMVU, MMMU, LVBench - Allocator overhead <3% of total inference FLOPs

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Today's Reading Tips 今日阅读推荐

Start with the Medical AI Scientist paper—it presents a compelling end-to-end autonomous research framework with strong empirical results (86-93% success rates, competitive manuscript quality) that bridges AI automation with clinical domain expertise. The other two papers on image editing evaluation and adaptive video reasoning share methodological themes around building robust evaluation benchmarks and efficient system design, making them valuable complements after understanding this foundational medical AI work.