2025-09-09 Papers

1/2

Paper 1

Reverse-Engineered Reasoning for Open-Ended Generation

Published: 2025-09-07

Link: http://arxiv.org/pdf/2509.06160

1. 📘 Topic and Domain: The paper introduces a new paradigm called "Reverse-Engineered Reasoning" (REER) for improving open-ended text generation capabilities in large language models.
2. 💡 Previous Research and New Ideas: Previous research relied on reinforcement learning and instruction distillation for reasoning capabilities; this paper proposes a novel "backwards" approach of discovering reasoning processes from known good solutions.
3. ❓ Problem: The paper aims to solve the challenge of instilling deep reasoning capabilities in language models for open-ended, creative generation tasks where traditional methods fail due to lack of verifiable rewards or high costs.
4. 🛠️ Methods: The authors use a gradient-free local search algorithm to iteratively refine reasoning trajectories by optimizing perplexity scores, creating a dataset of 20,000 deep reasoning examples to train their DeepWriter-8B model.
5. 📊 Results and Evaluation: DeepWriter-8B outperformed open-source baselines and achieved performance competitive with proprietary models like GPT-4o and Claude 3.5 on benchmarks like LongBench, HelloBench, and WritingBench.

Reverse-Engineered Reasoning for Open-Ended Generation

Reverse-Engineered Reasoning (REER) Workflow Data Sourcing • Public Writing Platforms • Project Gutenberg • Public Datasets REER: Reverse-Engineered Reasoning Process ① Init z⁽⁰⁾ Generate Initial Trajectory ② Expansion Segment-wise Refinement ③ Evaluate PPL(y|x,z) Selection Iterative Loop Optimization Objective: z* = arg min PPL(y|x,z) Lower perplexity indicates better reasoning trajectory Context Engineering • Meta-structure for segment-wise edits • Human-like thinking patterns ("Hmm...", "Wait...") • Self-reflection mechanisms DeepWriting-20K 20,000 trajectories 25 categories (x, z*, y) triples Quality Filtering • End-of-thinking filter • Repetition filter • Heuristic quality checks Mixed Training • 20K REER trajectories • Public datasets • Qwen3-8B base DeepWriter-8B Competitive with GPT-4o & Claude 3.5 on writing tasks Evaluation • LongBench • HelloBench • WritingBench Key Innovation: Working "Backwards" from Known Good Solutions Instead of building reasoning "forwards" through trial-and-error or distillation, REER discovers the latent thinking process that could have produced high-quality outputs
Q1
1. What is the key innovation of REER compared to traditional approaches?
It uses reinforcement learning to generate better responses
It works backwards from good solutions to discover reasoning processes
It distills knowledge from larger proprietary models
Q2
2. How does REER evaluate the quality of a reasoning trajectory?
By comparing it to human-written examples
By measuring the perplexity score of the known good solution
By using reinforcement learning rewards
Q3
3. What unique aspect of the DeepWriting-20K dataset creation process helps ensure high quality?
It uses only human-written examples
It relies on expensive proprietary models
It injects human-like thinking patterns and self-reflection tokens
1/2

Paper 2

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Published: 2025-09-08

Link: http://arxiv.org/pdf/2509.06501

1. 📘 Topic and Domain: The paper focuses on developing WebExplorer, a web agent training system in the domain of Large Language Models (LLMs) and information retrieval.
2. 💡 Previous Research and New Ideas: Previous research used graph-based and evolution-based approaches for web navigation data construction, while this paper introduces a novel model-based exploration and long-to-short query evolution approach.
3. ❓ Problem: The paper addresses the scarcity of challenging data for training web agents in complex information-seeking tasks.
4. 🛠️ Methods: The method combines model-based exploration to construct information spaces, iterative query evolution to increase difficulty, supervised fine-tuning for initialization, and reinforcement learning with GRPO algorithm for optimization.
5. 📊 Results and Evaluation: WebExplorer-8B achieved state-of-the-art performance across multiple benchmarks, including 15.7% on BrowseComp-en and 32.0% on BrowseComp-zh, outperforming larger models like WebSailor-72B despite having only 8B parameters.

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

WebExplorer: Data Generation and Training Pipeline Phase 1: Data Generation Model-Based Exploration • Seed Entity Input • Iterative Search & Browse • Information Space Construction • Initial QA Pair Generation Initial QA High accuracy ~7.9 turns Iterative Query Evolution • Remove Salient Information • Strategic Obfuscation • Long-to-Short Evolution • 5 Iteration Cycles WebExplorer-QA ~40K samples ~9.9 turns Quality Analysis Claude-4-Sonnet: 86.6% → 67.1% Average turns: 7.9 → 9.9 Phase 2: Training Pipeline Qwen3-8B Base Model Supervised Fine-tuning ~13K samples, 4 epochs Reinforcement Learning GRPO, ~12K samples WebExplorer-8B 128K context, 100 turns Progressive Scaling 64K→96K→128K context | 50→75→100 turns | Format + Correctness Rewards Phase 3: Evaluation Results Information Seeking BrowseComp-en: 15.7% BrowseComp-zh: 32.0% WebWalkerQA: 62.7% FRAMES: 75.7% Performance Analysis Outperforms WebSailor-72B SOTA at 8B scale Avg 16+ tool calls 40K+ token trajectories Generalization HLE: 17.3% GAIA: 50.0% XBench-DS: 53.7% Strong cross-domain Key Innovation Model-based exploration Long-to-short evolution Strategic obfuscation Long-horizon reasoning
Q1
1. What makes WebExplorer's query evolution approach different from previous methods?
It adds more information to make queries longer
It removes salient information to increase difficulty
It translates queries into multiple languages
Q2
2. What impressive achievement did WebExplorer-8B demonstrate despite its smaller size?
It processed queries faster than all other models
It achieved perfect accuracy on all benchmarks
It outperformed WebSailor-72B despite being 9x smaller
Q3
3. During reinforcement learning training, what significant change was observed in WebExplorer's behavior?
Average number of tool calls increased from 11 to over 16
Response time decreased by 50%
Memory usage was reduced by 75%
1/2

Paper 3

Does DINOv3 Set a New Medical Vision Standard?

Published: 2025-09-08

Link: http://arxiv.org/pdf/2509.06467

1. 📘 Topic and Domain: Evaluating DINOv3, a self-supervised vision transformer trained on natural images, for medical imaging tasks including 2D/3D classification and segmentation.
2. 💡 Previous Research and New Ideas: Based on DINO series and other medical vision models like BiomedCLIP, proposes using natural image-trained DINOv3 as a universal encoder for medical imaging without domain-specific pre-training.
3. ❓ Problem: Investigating whether DINOv3's visual features trained on natural images can effectively transfer to specialized medical imaging tasks without medical domain pre-training.
4. 🛠️ Methods: Conducted comprehensive benchmarking across multiple medical imaging tasks using linear probing, k-NN evaluation, and multiple instance learning, testing different model sizes (DINOv3-S/B/L) and input resolutions.
5. 📊 Results and Evaluation: DINOv3 showed strong performance on X-ray and CT tasks but struggled with specialized domains like pathology slides and PET scans, with inconsistent scaling benefits across different medical tasks and modalities.

Does DINOv3 Set a New Medical Vision Standard?

DINOv3 Medical Vision Benchmark Workflow Medical Imaging Datasets 2D Classification NIH-14, RSNA-Pneumonia Camelyon16/17, BCNB 3D Classification CT-RATE 3D Segmentation MSD, CREMI, AC3/4, AutoPET-II, HECKTOR DINOv3 Model Variants DINOv3-S (22M) DINOv3-B (86M) DINOv3-L (304M) Data Preprocessing Grayscale → 3-channel Slice-wise extraction (3D) WSI patch tiling Task Adaptation Methods Classification Linear Probing k-NN (CT-RATE) Multiple Instance Learning ABMIL for WSI Attention-based aggregation 3D Segmentation Slice-wise feature extraction 3D decoder + segmentation head Evaluation Metrics AUC, Accuracy, F1-Score Dice Score, HD95 VOI, ARAND (EM) Key Findings Strong Performance X-ray, CT classification Outperforms medical models Poor Performance WSI, EM, PET domains Large domain shift Scaling Laws Inconsistent in medical Task-dependent behavior
Q1
1. What was the most surprising finding about DINOv3's performance scaling in medical imaging tasks?
Larger models always performed better than smaller ones
Performance did not reliably increase with larger models or higher resolutions
The model only worked with low resolution medical images
Q2
2. In which type of medical imaging task did DINOv3 perform the worst?
Chest X-ray classification
CT scan analysis
PET scan tumor segmentation
Q3
3. What makes DINOv3 particularly interesting as a medical imaging model?
It was pre-trained specifically on medical images
It outperformed medical-specific models despite being trained only on natural images
It was designed exclusively for 3D medical image analysis