2025-09-08 Papers

1/2

Paper 1

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Published: 2025-09-04

Link: http://arxiv.org/pdf/2509.04292

1. 📘 Topic and Domain: The paper presents Inverse IFEval, a benchmark for evaluating large language models' ability to follow counter-intuitive instructions that conflict with their training conventions.
2. 💡 Previous Research and New Ideas: The paper builds on existing instruction-following evaluation benchmarks like MMLU and IFEval, but introduces a novel focus on testing models' ability to override training-induced biases and follow adversarial instructions.
3. ❓ Problem: The paper addresses LLMs' cognitive inertia - their tendency to stubbornly follow standardized patterns learned during training rather than adapting to unconventional or counter-intuitive instructions.
4. 🛠️ Methods: The authors constructed a dataset of 1012 Chinese and English questions across 23 domains using a multi-stage human-in-the-loop pipeline, incorporating eight types of counter-intuitive instructions and evaluating models using an optimized LLM-as-a-Judge framework.
5. 📊 Results and Evaluation: The results showed that even advanced LLMs struggle with counter-intuitive instructions, with o3-high performing best but still showing limitations, highlighting the need for improved model adaptability to unconventional contexts.

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Inverse IFEval: Methodology Flow Chart Problem Identification Cognitive Inertia in LLMs Training Convention Bias SFT Paradigm Analysis Observation & Reversal Identify Training Patterns Benchmark Design 8 Adversarial Instruction Types Counter-Cognitive Ability Data Construction Pipeline Seed Data Construction Expert Manual Crafting Cross-validation Inter-rater Agreement Large-Scale Generation LLM-based Generation 23 Domains Coverage Prompt Engineering Automatic Filtering Length Constraints Semantic Similarity Quality Assurance Human Verification Type Consistency Clarity Check Rubric Calibration Final Dataset 1012 Questions Chinese & English Metadata Annotation Evaluation Framework LLM-as-a-Judge Dedicated Judge Selection 98% Accuracy Optimization Strategies Template Structure System Prompt Enhancement Experimental Analysis Thinking Mechanism Performance Impact Instruction Types Comparative Analysis IFEval Comparison Ranking Variations Test-Time Scaling Best-of-N Analysis Key Findings: LLMs show cognitive inertia when following counter-intuitive instructions Thinking mechanisms improve performance • Fine-tuned models struggle more • Performance varies across instruction types
Q1
1. What is the main challenge or limitation of LLMs that Inverse IFEval aims to evaluate?
Models' inability to understand complex instructions
Models' tendency to stick to training conventions rather than following new instructions
Models' poor performance in multiple languages
Q2
2. Which of the following is NOT one of the eight types of counter-intuitive instructions tested in Inverse IFEval?
Question Correction and Code without Comments
Deliberately Incorrect Answers and Counterfactual Answering
Emotional Response Generation and Sentiment Analysis
Q3
3. What unique aspect of the evaluation methodology makes Inverse IFEval different from traditional benchmarks?
It only tests models in English language
It focuses on models' speed and efficiency
It deliberately creates scenarios that conflict with training patterns
1/2

Paper 2

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Published: 2025-08-25

Link: http://arxiv.org/pdf/2508.18106

1. 📘 Topic and Domain: A benchmark for evaluating security in AI-generated code at the repository level, focusing on software engineering and code security.
2. 💡 Previous Research and New Ideas: Based on existing code security benchmarks that focus on isolated snippets, this paper proposes a new benchmark that evaluates security at the repository level while maintaining full project context and dependencies.
3. ❓ Problem: Current benchmarks lack repository-level context, have unstable evaluation methods, and fail to connect input context quality with output security, making it difficult to properly assess AI-generated code security.
4. 🛠️ Methods: Created A.S.E benchmark using real-world repositories with documented CVEs, implemented a containerized evaluation framework with expert-defined rules, and evaluated models across security, build quality, and generation stability metrics.
5. 📊 Results and Evaluation: Claude-3.7-Sonnet achieved best overall performance (52.79 points), open-source model Qwen3-235B-A22B-Instruct achieved best security score, and "fast-thinking" configurations consistently outperformed "slow-thinking" strategies for security patching.

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

A.S.E Benchmark Construction and Evaluation Workflow Data Source Collection 100k+ CVE entries GitHub repositories Algorithm-Guided Screening CWE + Language Filter Expert-Guided Curation Manual Review + SAST Dataset Expansion Semantic Mutation Structural Mutation 5k+ 199 40 120 Vulnerability Categories (CWE) SQL Injection CWE-89 (29.2%) Path Traversal CWE-22 (26.7%) XSS CWE-79 (25.0%) Command Inject CWE-78 (19.2%) Programming Languages PHP 50.0% Python 19.2% Go 14.2% JavaScript 14.2% Java 2.5% Code Generation Phase Repository Context BM25 Retrieval Masked Vulnerable Code LLM Generation Security Evaluation Framework Security (60%) Quality (30%) Stability (10%) CodeQL + Joern SAST Docker Containerization Key Technologies Docker CodeQL Joern BM25 Git Apply SAST Expert Rules Mutation Key Findings Claude-3.7-Sonnet Best Overall (63.01) Security: 46.72 Qwen3-235B-A22B Best Security (48.03) Open-source leader Fast vs Slow Thinking Fast outperforms in security tasks Overall Score = 0.6 × Security + 0.3 × Quality + 0.1 × Stability 26 SOTA LLMs evaluated • 120 repository instances • 3 trials per model
Q1
1. What surprising finding did the study reveal about model reasoning strategies?
Complex 'slow-thinking' strategies performed best for security patches
Fast and slow thinking strategies performed equally well
Simple 'fast-thinking' strategies outperformed complex reasoning approaches
Q2
2. Which vulnerability type presented the greatest challenge for LLMs according to the benchmark results?
SQL Injection
Path Traversal
Cross-Site Scripting (XSS)
Q3
3. What key innovation differentiates A.S.E from previous code security benchmarks?
It focuses only on snippet-level code evaluation
It uses LLMs as the primary security evaluators
It preserves full repository context and cross-file dependencies
1/2

Paper 3

Transition Models: Rethinking the Generative Learning Objective

Published: 2025-09-04

Link: http://arxiv.org/pdf/2509.04394

1. 📘 Topic and Domain: Generative modeling with a focus on improving diffusion models for efficient and high-quality image generation.
2. 💡 Previous Research and New Ideas: Based on diffusion models, consistency models, and flow-matching approaches; proposes a novel Transition Models (TiM) framework that learns state transitions across arbitrary time intervals.
3. ❓ Problem: Addresses the trade-off between generation quality and computational efficiency in existing generative models, where models either require many steps for high quality or sacrifice quality for speed.
4. 🛠️ Methods: Introduces a continuous-time dynamics equation for state transitions, implements decoupled time embeddings and interval-aware attention, and uses an efficient Differential Derivation Equation (DDE) for training.
5. 📊 Results and Evaluation: TiM (865M parameters) outperforms larger models like SD3.5 (8B) and FLUX.1 (12B) across different metrics, achieving superior performance with both single-step and multi-step generation while scaling effectively to 4096×4096 resolution.

Transition Models: Rethinking the Generative Learning Objective

Transition Models (TiM) Methodology Flow Problem Analysis Diffusion Models: High quality, many steps Few-step Models: Fast but quality ceiling Performance saturation Core Innovation State Transition Identity d/dt(B_t,r · h(t)) = 0 Arbitrary interval Δt Learning solution manifold instead of local dynamics Mathematical Framework State Transition: x_r = A_t,r·x_t + B_t,r·f_θ(x_t,t,r) Training Target: f̂ = α̂_t·x + σ̂_t·ε + corrections Scalability Solution DDE Method df/dt ≈ finite difference 2× faster than JVP FSDP compatible Stability Enhancement Loss Weighting w(t,r) = (σ_data + tan(t) - tan(r))^(-1/2) Prioritize short intervals Reduce gradient variance Architecture Decoupled Time Embedding E_t,Δt = φ_t(t) + φ_Δt(Δt) Interval-Aware Attention Q,K,V += E_Δt projections Training Process From Scratch Training 865M parameters Native resolution 30 days on 16 A100s Sampling Strategy Arbitrary-Step Sampling 1-NFE to 128-NFE Monotonic improvement Schedule insensitive Key Properties Trajectory Consistency Time-Slope Matching Smooth solution manifold Stable refinement Results SOTA Performance Outperforms SD3.5 & FLUX.1 GenEval: 0.67→0.83 Up to 4096×4096 resolution Final Training Objective E[w(t,r) · d(f_θ(x_t,t,r) - f̂)] Unifies few-step efficiency with many-step quality Key Innovations Summary State Transition Identity: Enables arbitrary-step transitions DDE Method: Scalable derivative computation for large models Interval-Aware Architecture: Adaptive attention based on step size Unified Framework: Single model for 1-step to 128-step generation SOTA Results: 865M model outperforms 8B-12B models Monotonic Improvement: Quality increases consistently with more steps
Q1
1. What is the key innovation in TiM that allows it to overcome the speed-quality trade-off of existing models?
Using larger neural networks with more parameters
Learning arbitrary-interval state transitions along the generative trajectory
Implementing a new type of attention mechanism
Q2
2. How does TiM's parameter efficiency compare to other state-of-the-art models?
TiM requires more parameters but runs faster
TiM and other models use similar parameter counts
TiM achieves better results with significantly fewer parameters (865M vs 8-12B)
Q3
3. What is the most significant practical improvement introduced by TiM's Differential Derivation Equation (DDE)?
It improves the quality of generated images
It makes the model training compatible with distributed frameworks like FSDP
It reduces the required training time