Scoring AI Hackers Without an Answer Key: Evaluation Methods

Post Views: 10

Scoring AI hackers when there is no answer key AI models are increasingly demonstrating proficiency in offensive cybersecurity tests designed to evaluate their capabilities.

The Problem with Existing Benchmarks

As these systems solve more benchmarks, the metrics lose their ability to differentiate top performers. Many existing assessments rely on vulnerabilities with publicly documented exploits, allowing models to achieve high scores by replicating known techniques.

FrontierCyber: A New Approach

FrontierCyber, a benchmark developed by the AI security lab Irregular, addresses this limitation by testing models against real-world systems without pre-planted flaws or guidance. The benchmark simulates attacks on tangible targets such as mobile devices, hosted applications, databases, and live networks. These systems retain their standard security measures, including sandboxing, authentication protocols, and network segmentation. Models are provided with a specific objective and an initial access point, with no additional hints about potential weaknesses. The lab spent six months designing the framework, releasing version 1.0 this week.

Predicting Challenge Complexity

Predicting challenge complexity before execution presents a core challenge. Traditional benchmarks include predefined difficulty levels, but open-ended tests require estimating complexity beforehand. FrontierCyber employs a two-step approach. Before a model interacts with a system, each challenge receives a difficulty rating—Easy, Medium, Hard, or Elite—based on factors like code language, visibility of components, historical vulnerability data, and defensive barriers. Systems are evaluated using software analogs, such as browsers for web interfaces or apps for application-level code, then adjusted for surface area, objectives, and device configurations.

Tracking Partial Progress

Tracking partial progress is critical, as many attempts terminate before achieving the goal. Irregular embeds subtle indicators within systems, such as unique identifiers, anomalous files, or database markers, to detect unauthorized access. For instance, a mobile browser should not recognize installed applications, so the lab installs an app with a distinct name. If a model references this name during analysis, it suggests bypassing security restrictions, even if the full objective remains unmet.

Evaluating Outcomes

Evaluating outcomes involves analyzing both the model’s actions and the system’s responses. A complete success is verified by recovering a hidden flag or achieving a defined system state. Partial achievements, such as identifying exploitable weaknesses or constructing components of an attack, receive proportional credit. Automated tools handle routine checks, while human experts assess nuanced findings. A scoring agent then compares results against benchmarks derived from expert evaluations.

Assessing Capability Across Challenges

Assessing capability across multiple challenges ensures reliability. Individual test outcomes can vary due to random factors, so results are aggregated across the entire dataset. Real-world systems evolve over time, with updates, configuration changes, and newly disclosed vulnerabilities affecting test validity. To maintain consistency, each evaluation is tied to a specific system snapshot, including exact configurations, timestamps, and scoring rules. This ensures comparisons remain accurate only within the same snapshot timeframe.

Initial Testing Results

Initial testing against a static snapshot yielded meaningful results. Models solved certain challenges outright, advanced significantly in others, and uncovered previously unknown vulnerabilities in live systems undergoing responsible disclosure. One mobile device test involved a model exploiting a chain of separate flaws to access restricted data. Across different model generations, newer versions showed measurable improvements, with some completing full exploit chains and others identifying actionable vulnerabilities.

Test Environment and Future Reports

The test environment includes software like Pillow, lxml, FFmpeg, ImageMagick, PostgreSQL, MongoDB, and Redis, alongside deliberately vulnerable versions to assess exploit development skills. A comprehensive report detailing challenges, scoring methodologies, and findings is forthcoming.