Evaluating AI Agent Resilience Against Smart Contract Exploits: An Open-Source Benchmark

Post Views: 155

A New Benchmarking Tool for Smart Contract Security

A new benchmarking tool, EVMbench, has been released to assess the performance of artificial intelligence (AI) agents in detecting, patching, and exploiting vulnerabilities in smart contract code.

Developed by OpenAI and Paradigm

Developed by OpenAI and Paradigm, EVMbench is an open-source platform that evaluates AI models on practical smart contract security tasks, providing a repeatable and transparent way to measure their effectiveness.

Three Primary Tasks

The benchmark focuses on three primary tasks: detecting vulnerabilities, patching vulnerable code, and exploiting flaws in a controlled environment.

Detecting vulnerabilities: The AI model reviews smart contract repositories to identify known vulnerabilities documented by professional auditors.
Patching vulnerable code: The AI model modifies contract code to remove vulnerabilities without breaking expected functionality.
Exploiting flaws: The AI model is given a sandboxed blockchain environment and asked to execute an exploit against a vulnerable contract.

EVMbench Dataset

EVMbench uses a dataset of 120 curated vulnerabilities across 40 audits, with most cases drawn from open audit competitions and additional scenarios sourced from Paradigm’s Tempo audit process.

The benchmark is designed to reflect realistic development conditions, increasing complexity compared to synthetic vulnerability datasets.

Exploit Grading Process

The exploit grading process is handled through deterministic replay in a controlled test environment, where the AI model interacts with a local EVM instance to deploy contracts, call functions, and attempt to execute fund-draining transactions.

Initial Benchmark Results

Initial benchmark results show uneven performance across detect, patch, and exploit tasks. While some models can identify vulnerabilities at a surface level, exploit tasks remain challenging.

However, recent model generations have shown significant improvement, with some models able to exploit over 70% of critical vulnerabilities.

Patching remains a major weakness, requiring AI models to preserve correct behavior across edge cases and understand deeper design assumptions in the code.

Availability and Impact

EVMbench is available for free on GitHub, allowing researchers and security teams to test models consistently as AI agent capabilities evolve.

The release of EVMbench highlights the growing importance of smart contract security, with billions of dollars in open-source crypto assets secured by contracts that are vulnerable to exploits.

By providing a standardized benchmarking platform, EVMbench aims to drive innovation and improvement in AI-powered smart contract security tools.