1. 📘 Topic and Domain: A benchmark for evaluating security in AI-generated code at the repository level, focusing on software engineering and code security.
2. 💡 Previous Research and New Ideas: Based on existing code security benchmarks that focus on isolated snippets, this paper proposes a new benchmark that evaluates security at the repository level while maintaining full project context and dependencies.
3. ❓ Problem: Current benchmarks lack repository-level context, have unstable evaluation methods, and fail to connect input context quality with output security, making it difficult to properly assess AI-generated code security.
4. 🛠️ Methods: Created A.S.E benchmark using real-world repositories with documented CVEs, implemented a containerized evaluation framework with expert-defined rules, and evaluated models across security, build quality, and generation stability metrics.
5. 📊 Results and Evaluation: Claude-3.7-Sonnet achieved best overall performance (52.79 points), open-source model Qwen3-235B-A22B-Instruct achieved best security score, and "fast-thinking" configurations consistently outperformed "slow-thinking" strategies for security patching.