2025-09-08 Papers

1/2

Paper 1

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Published: 2025-09-04

Link: http://arxiv.org/pdf/2509.04292

1. 📘 Topic and Domain: The paper presents Inverse IFEval, a benchmark for evaluating large language models' ability to follow counter-intuitive instructions that conflict with their training conventions.

2. 💡 Previous Research and New Ideas: The paper builds on existing instruction-following evaluation benchmarks like MMLU and IFEval, but introduces a novel focus on testing models' ability to override training-induced biases and follow adversarial instructions.

3. ❓ Problem: The paper addresses LLMs' cognitive inertia - their tendency to stubbornly follow standardized patterns learned during training rather than adapting to unconventional or counter-intuitive instructions.

4. 🛠️ Methods: The authors constructed a dataset of 1012 Chinese and English questions across 23 domains using a multi-stage human-in-the-loop pipeline, incorporating eight types of counter-intuitive instructions and evaluating models using an optimized LLM-as-a-Judge framework.

5. 📊 Results and Evaluation: The results showed that even advanced LLMs struggle with counter-intuitive instructions, with o3-high performing best but still showing limitations, highlighting the need for improved model adaptability to unconventional contexts.

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

1/2

Paper 2

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Published: 2025-08-25

Link: http://arxiv.org/pdf/2508.18106

1. 📘 Topic and Domain: A benchmark for evaluating security in AI-generated code at the repository level, focusing on software engineering and code security.

2. 💡 Previous Research and New Ideas: Based on existing code security benchmarks that focus on isolated snippets, this paper proposes a new benchmark that evaluates security at the repository level while maintaining full project context and dependencies.

3. ❓ Problem: Current benchmarks lack repository-level context, have unstable evaluation methods, and fail to connect input context quality with output security, making it difficult to properly assess AI-generated code security.

4. 🛠️ Methods: Created A.S.E benchmark using real-world repositories with documented CVEs, implemented a containerized evaluation framework with expert-defined rules, and evaluated models across security, build quality, and generation stability metrics.

5. 📊 Results and Evaluation: Claude-3.7-Sonnet achieved best overall performance (52.79 points), open-source model Qwen3-235B-A22B-Instruct achieved best security score, and "fast-thinking" configurations consistently outperformed "slow-thinking" strategies for security patching.

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

1/2

Paper 3

Transition Models: Rethinking the Generative Learning Objective

Published: 2025-09-04

Link: http://arxiv.org/pdf/2509.04394

1. 📘 Topic and Domain: Generative modeling with a focus on improving diffusion models for efficient and high-quality image generation.

2. 💡 Previous Research and New Ideas: Based on diffusion models, consistency models, and flow-matching approaches; proposes a novel Transition Models (TiM) framework that learns state transitions across arbitrary time intervals.

3. ❓ Problem: Addresses the trade-off between generation quality and computational efficiency in existing generative models, where models either require many steps for high quality or sacrifice quality for speed.

4. 🛠️ Methods: Introduces a continuous-time dynamics equation for state transitions, implements decoupled time embeddings and interval-aware attention, and uses an efficient Differential Derivation Equation (DDE) for training.

5. 📊 Results and Evaluation: TiM (865M parameters) outperforms larger models like SD3.5 (8B) and FLUX.1 (12B) across different metrics, achieving superior performance with both single-step and multi-step generation while scaling effectively to 4096×4096 resolution.