Malware Detectors’ Overfitting Issue on Dataset-Specific Models
Static Detectors Struggle to Catch Obfuscated Malware Outside Their Training Dataset
The effectiveness of machine learning (ML)-based static detectors in catching malware outside their training dataset has been called into question by recent research. These detectors, commonly used as a first line of defense against malware, are typically trained on data that closely resembles their training set.
However, the study found that when faced with malware that does not match the distribution they were trained on, these detectors can struggle to detect it.
The researchers tested the ability of static detectors to identify malware that had been obfuscated to evade detection. They created detection pipelines using a standardized feature format common across six public Windows PE datasets. The models were then evaluated not only on held-out data from their own training distribution but also on four external datasets:
- TRITIUM, built from naturally occurring threat samples collected from operational environments;
- INFERNO, derived from red team and custom command-and-control malware;
- SOREL-20M, a large-scale benchmark covering several years of real-world PE files;
- ERMDS, a dataset constructed specifically to challenge detectors with obfuscated samples at the binary, source code, and packer levels.
On data from their own training distribution, the best-performing models achieved AUC and F1 scores in the high 90s, indicating strong performance. However, when evaluating the models on the external datasets, the results told a more sobering story. The models performed poorly on INFERNO, the red team and C2 dataset, and SOREL-20M, the largest and most temporally diverse external dataset, with some model configurations falling far enough that their practical utility at low false positive rates would be limited.
One of the more instructive findings involved the attempt to fix the obfuscation problem directly. Adding ERMDS to the training set improved performance on obfuscated samples within that dataset’s distribution. However, it also reduced generalization to SOREL-20M relative to training without it. This pattern suggests a tension that practitioners building or procuring static detectors should be aware of: training a model to recognize obfuscated malware can shift its feature distribution in ways that reduce its effectiveness on broader, more diverse data.
This research highlights the importance of considering the threat landscape the detector will encounter when evaluating its benchmark performance. Red team tooling, packed malware, and temporally shifted samples can all degrade a model that looks strong on paper. The researchers plan to extend the evaluation to deep learning architectures, with continued focus on how training data composition affects detection at the low false positive rates required for production deployments.
