2026-03-31 Papers

1/2

Paper 1

Towards a Medical AI Scientist

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28589

1. 📘 Topic and Domain: The paper introduces Medical AI Scientist, an autonomous framework for end-to-end clinical medical AI research automation spanning hypothesis generation, experimental validation, and manuscript drafting.
2. 💡 Previous Research and New Ideas: Based on existing AI Scientist systems (AI Scientist-v2, AI-Researcher, Agent Laboratory) and medical AI applications, the paper proposes a novel clinician-engineer co-reasoning mechanism and three operational modes (Reproduction, Innovation, Exploration) tailored specifically for clinical medicine.
3. ❓ Problem: Existing AI Scientists are domain-agnostic, lacking mechanisms to ground hypotheses in medical evidence, handle heterogeneous clinical data formats, or ensure ethical compliance, making them unsuitable for clinical autonomous research.
4. 🛠️ Methods: The framework comprises three components: an Idea Proposer with clinician-engineer co-reasoning for evidence-grounded hypothesis generation, an Experimental Executor orchestrating domain-specific medical toolboxes in Dockerized environments, and a hierarchical Manuscript Composer enforcing structured medical writing with ethical review.
5. 📊 Results and Evaluation: Across 171 evaluation cases (19 tasks, 6 modalities), the system outperforms GPT-5 and Gemini-2.5-Pro in idea quality across six dimensions; achieves 86-93% experimental success rates; and generates manuscripts scoring 4.60±0.56 (Stanford Agentic Reviewer), competitive with MICCAI/ISBI/BIBM publications under double-blind human evaluation, with one manuscript accepted at ICAIS 2025.
1. 📘 主题与领域: 该论文提出了"医学AI科学家"(Medical AI Scientist),一个用于端到端临床医学AI研究自动化的自主框架,涵盖假设生成、实验验证和论文撰写。
2. 💡 先前研究与新思路: 基于现有AI科学家系统(如AI Scientist-v2、AI-Researcher、Agent Laboratory)和医学AI应用,论文提出了专门针对临床医学设计的临床医生-工程师协同推理机制,以及三种操作模式(复现、创新、探索)。
3. ❓ 问题: 现有AI科学家系统缺乏领域针对性,无法将假设建立在医学证据基础上、处理异构临床数据格式或确保伦理合规,难以用于临床自主研究。
4. 🛠️ 方法: 该框架包含三个核心组件:具备临床医生-工程师协同推理功能的"创意生成器"、在Docker化环境中编排领域特定医学工具包的"实验执行器",以及执行结构化医学写作并嵌入伦理审查的"论文合成器"。
5. 📊 结果与评估: 在171个评估案例(19项任务、6种数据模态)中,该系统在六个维度的创意质量上优于GPT-5和Gemini-2.5-Pro;实现86-93%的实验成功率;生成的论文在斯坦福智能审稿人评估中获得4.60±0.56分,在双盲人类评估中与MICCAI/ISBI/BIBM论文相当,其中一篇论文已被ICAIS 2025接收。

Towards a Medical AI Scientist

Medical AI Scientist Workflow Three Research Modes Paper-based Reproduction Literature-inspired Innovation Task-driven Exploration Task Instructions + Dataset (± Reference Papers) Idea Proposer Analyzer Medical Task Formalization Structured Task Representation Explorer Paradigm Selection Emerging Technology Matching Preparer & Surveyor Literature Synthesis Evidence Grounding Generator (Co-Reasoning) Clinician-Engineer Mechanism Hypothesis Formation Assessor Scientific & Ethical Evaluation Research Plan Output Experimental Executor Investigator Codebase Assembly Domain-specific Toolboxes Planner Pipeline Decomposition Execution Protocol Design Executor Training & Validation Pipeline Dockerized Environment Judger Consistency Evaluation Corrective Feedback Analyst Results Consolidation Structured Records Plan Results Manuscript Composer Content Generator Evidence-based Writing Ethics Reviewer Compliance Check Narrative Enhancer Clarity & Storyline Reference Resolver Internal Consistency Output Publication-ready Manuscript Med-AI Bench 171 Evaluation Cases 19 Clinical Tasks | 6 Data Modalities 19 Tasks: Image, Video, EHR, Text, Signal, Multimodal Evaluation Benchmark Reflect & Refine Cycle Legend Idea Proposer Experimental Executor Manuscript Composer Iterative Refinement Data Flow
Q1
1. Which of the following is NOT one of the three research modes supported by the Medical AI Scientist framework?
Paper-based Reproduction
Hypothesis-driven Verification
Task-driven Exploration
Q2
2. What is the core innovation in the Idea Proposer that helps reduce hallucinations during hypothesis generation?
Multi-agent debate system
Clinician-engineer co-reasoning mechanism
Bayesian optimization framework
Q3
3. According to the paper, what success rate did the Medical AI Scientist achieve in the Task-driven Exploration mode during code execution experiments?
0.86
0.93
0.91
1/2

Paper 2

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28547

1. 📘 Topic and Domain: The paper focuses on the evaluation of general image editing models, specifically addressing visual consistency assessment, and belongs to the domain of computer vision and multimodal AI.
2. 💡 Previous Research and New Ideas: The paper builds on existing VLM-as-a-Judge evaluation paradigms but introduces two novel contributions: (1) GEditBench v2, a benchmark with 23 diverse tasks including an open-set category for real-world unconstrained editing, and (2) PVC-Judge, an open-source pairwise assessment model for visual consistency trained via two region-decoupled preference data synthesis pipelines.
3. ❓ Problem: The paper addresses the limitation that existing image editing benchmarks suffer from narrow task coverage and that current evaluation metrics fail to adequately capture visual consistency (preservation of identity, structure, and semantic coherence between edited and original images).
4. 🛠️ Methods: The authors develop GEditBench v2 with 1,200 real-world queries spanning 23 tasks, create PVC-Judge by fine-tuning Qwen3-VL-8B-Instruct with ~128k preference pairs constructed using object-centric, human-centric, and VLM-as-a-Judge pipelines, and establish VCReward-Bench with 3,506 expert-annotated preference pairs for meta-evaluation.
1. 📘 主题与领域: 该论文聚焦于通用图像编辑模型的评估,特别是视觉一致性的评估问题,属于计算机视觉和多模态人工智能领域。
2. 💡 先前研究与新思路: 该论文基于现有的VLM-as-a-Judge评估范式,提出了两个创新点:(1) GEditBench v2,一个包含23种多样化任务的基准测试,且特别引入了开放集类别用于评估真实世界的无约束编辑任务;(2) PVC-Judge,一个开源的成对视觉一致性评估模型,通过两种区域解耦的偏好数据合成流程进行训练。
3. ❓ 问题: 该论文旨在解决现有图像编辑基准测试任务覆盖范围狭窄,以及当前评估指标未能充分捕捉视觉一致性问题(即编辑后图像与原始图像在身份、结构和语义连贯性方面的保持程度)。
4. 🛠️ 方法: 作者构建了包含1,200个真实世界查询、涵盖23项任务的GEditBench v2;通过目标中心、人物中心和VLM-as-a-Judge三种偏好数据构建流程,利用约128k个偏好对微调Qwen3-VL-8B-Instruct得到PVC-Judge;并建立了包含3,506个专家标注偏好对的VCReward-Bench用于元评估。
5. 📊 结果与评估: 实验表明,PVC-Judge在开源评估模型中达到最优性能,平均准确率达81.82%,甚至超越了GPT-5.1(76.89%);在GEditBench v2上对16个前沿编辑模型的评估显示,其Overall Elo分数与人类标注的Arena排名具有很强的Spearman等级相关性(ρ=0.929),验证了该评估体系与人类偏好的高度对齐。

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

GEditBench v2 Workflow GEditBench v2 Construction Task Categories (23 Tasks) Local (12 tasks) Global (6 tasks) Reference (3 tasks) Hybrid (1 task) Open-Set (100 real-world queries) Data Collection Reddit, X • Expert filtering • Privacy protection Public images + Nano Banana Pro generation 1,200 Testing Examples PVC-Judge Development Stage 1: Candidate Image Generation • Prompt Curation Pico-Banana-400K, Nano-Consistency-150K, UnicEdit-10M • K-Center Greedy Selection N=1,500 samples per task (Qwen3-VL-Embedding) • Image Generation 7 models: BAGEL, Kontext, Step1X-Edit, Qwen-Image-Edit series ~180k images Stage 2: Preference Data Construction Object-centric Pipeline • Region Decoupling • Region-Specific Metrics • Z-score + Pareto Filtering Human-centric Pipeline • Face ID, Body, Hair • Expert Models (ArcFace) • Dynamic Attribute Exclusion VLM-as-a-Judge (Global Tasks) Gemini 3 Pro • P=6 in-group pairs per sample Background Change, Style/Tone Transfer, Enhancement ~128k pairs Stage 3: Training Configuration • Base Model: Qwen3-VL-8B-Instruct • LoRA (r=64, α=128) • AdamW, LR=2e-6 • 3 epochs, batch=16, 8×L40S GPUs PVC-Judge VCReward-Bench Meta-Benchmark for Assessment • 3,506 expert-annotated preference pairs • 21 predefined tasks (from GEditBench v2) • 8 editing models (incl. proprietary) • Pareto filtering: strict VC preference Evaluation Pipeline Instruction Following GPT-4o pairwise Visual Quality GPT-4o pairwise Visual Consistency PVC-Judge Ranking System Bradley-Terry Model → Elo Ratings 95% CI via 1,000 bootstrap iterations Key Results Meta-Evaluation on VCReward-Bench PVC-Judge achieves SOTA among open-source Avg Accuracy: 81.82% vs GPT-5.1: 76.89% Outperforms EditScore, EditReward, Qwen3-VL-8B Benchmark Results (16 Models) Top 3: Nano Banana Pro, GPT-Image-1.5, Seedream 4.5 Best Open-Source: FLUX.2 [klein] 9B (4-step distilled) Spearman ρ=0.929 with Arena rankings Key Insights • Pairwise > Pointwise (human alignment) • Trade-off: Instruction Following ↔ Visual Consistency Workflow Summary GEditBench v2 1,200 examples, 23 tasks PVC-Judge Training ~128k preference pairs VCReward-Bench 3,506 expert pairs Evaluation & Ranking 16 editing models Evaluation Dimensions IF (GPT-4o) VQ (GPT-4o) VC (PVC-Judge) BT Model → Elo Ratings 95% CI Bootstrap Spearman ρ
Q1
1. What is a unique feature of GEditBench v2 compared to previous image editing benchmarks?
It uses pointwise scoring instead of pairwise comparison
It includes an open-set category for unconstrained real-world editing instructions
It only evaluates closed-set predefined tasks without any new categories
Q2
2. How was PVC-Judge trained to achieve human-aligned pairwise evaluation for visual consistency?
Using a single unified metric across all editing tasks
Via region-decoupled preference data synthesis pipelines (object-centric and human-centric)
By collecting human ratings exclusively through crowdsourcing platforms
Q3
3. According to experiments on VCReward-Bench, what did PVC-Judge achieve compared to GPT-5.1?
It performed significantly worse with an average accuracy of 65%
It achieved state-of-the-art performance among open-source models, even surpassing GPT-5.1 with 81.82 vs 76.89 average accuracy
It only worked well for simple local editing tasks like subject removal
1/2

Paper 3

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Published: 2026-03-30

Link: http://arxiv.org/pdf/2603.28610

1. 📘 Topic and Domain: Adaptive resolution for efficient multimodal reasoning in video understanding with Multimodal Large Language Models (MLLMs).
2. 💡 Previous Research and New Ideas: - Previous: Various compression methods for MLLMs including token dropping, merging, and uniform resizing. - New: Input-side adaptation via a learned "Allocator" module that dynamically adjusts per-frame resolution based on content importance using a Beta policy, trained with joint RL (GRPO) and CAPO reward shaping.
3. ❓ Problem: MLLMs process video at uniform resolution, wasting computation on visually simple or redundant frames while potentially under-processing complex frames that require higher fidelity for accurate reasoning.
4. 🛠️ Methods: - Allocator module: Lightweight predictor using Beta distribution policy to output per-frame scale factors - Joint RL training with alternating optimization between Allocator and MLLM backbone - CAPO reward shaping: Combines task accuracy with cost penalty and group normalization - Temporal similarity regularizer (Lsim): Prevents collapse to uniform allocation - Resize instantiation: Bilinear rescaling of frames based on predicted scales
5. 📊 Results and Evaluation: - 83× reduction in backbone attention FLOPs at ρ=0.11 token retention - Token retention ratio ρ ∈ [0.06, 0.16] across benchmarks - 6-16× increase in effective temporal horizon at fixed compute - Maintains or improves accuracy across VideoMME, LongVideoBench, MMVU, MMMU, LVBench - Allocator overhead <3% of total inference FLOPs
1. 📘 主题与领域: 多模态大语言模型(MLLM)在视频理解中的自适应分辨率高效推理。
2. 💡 先前研究与新思路: - 先前:包括 token 丢弃、合并和均匀缩放等多种 MLLM 压缩方法。 - 新思路:输入端自适应方法,通过可学习的"分配器"(Allocator)模块,基于内容重要性使用 Beta 策略动态调整每帧分辨率,结合 GRPO 联合强化学习和 CAPO 奖励塑形进行训练。
3. ❓ 问题: MLLM 以均匀分辨率处理视频,在视觉简单或冗余帧上浪费计算资源,同时可能对需要更高保真度才能准确推理的复杂帧处理不足。
4. 🛠️ 方法: - 分配器模块:使用 Beta 分布策略的轻量级预测器,输出每帧缩放因子 - 交替优化:分配器与 MLLM 主干之间的联合强化学习训练 - CAPO 奖励塑形:结合任务准确率、成本惩罚和组归一化 - 时间相似性正则化(Lsim):防止退化为均匀分配 - 缩放实例化:根据预测缩放因子对帧进行双线性缩放
5. 📊 结果与评估: - 在 ρ=0.11 token 保留率下实现 83× 主干注意力 FLOPs reduction - 跨基准测试的 token 保留率 ρ ∈ [0.06, 0.16] - 固定计算量下有效时间范围增加 6-16× - 在 VideoMME、LongVideoBench、MMVU、MMMU、LVBench 上保持或提升准确率 - 分配器开销占总推理 FLOPs 不足 3%

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Input: Video V, Query q Allocator (Beta Policy π_θ) Resize Adaptive Rescaling MLLM Backbone π_φ Output Answer o Rollout & Reward CAPO / GRPO Update Policy Training Loop Inference Path Inference Path
Q1
1. What type of policy distribution does the Allocator module use to determine per-frame resolution scales?
Gaussian distribution policy
Beta distribution policy
Uniform distribution policy
Q2
2. What reward shaping technique does CAPO use to balance task accuracy and budget control while preventing collapse to minimum budgets?
Task accuracy with linear cost penalty only
Task accuracy combined with cost penalty and group normalization
Cost penalty with temporal smoothing only
Q3
3. At approximately what token retention ratio ρ does ResAdapt achieve its reported 83× reduction in backbone attention FLOPs?
ρ = 0.5
ρ = 0.01
ρ = 0.11

Today's Reading Tips 今日阅读推荐

Start with the Medical AI Scientist paper—it presents a compelling end-to-end autonomous research framework with strong empirical results (86-93% success rates, competitive manuscript quality) that bridges AI automation with clinical domain expertise. The other two papers on image editing evaluation and adaptive video reasoning share methodological themes around building robust evaluation benchmarks and efficient system design, making them valuable complements after understanding this foundational medical AI work.