LLM Security Advice: Uncovering Hidden Vulnerabilities in Real-World Scenarios

Post Views: 9

Users increasingly seek solutions for security issues through chatbots, ranging from compromised accounts to potential stalking situations.

HelpBench: Evaluating AI Security Guidance

Researchers from University College London and Google created HelpBench using 450 questions derived from online discussions about digital privacy, safety, and security. Each query was rephrased using an AI model and manual review to mimic natural user input while removing identifying details. The dataset was then tested against 18 major language models, with responses evaluated using expert-designed rubrics focusing on factual accuracy, clarity, and appropriateness.

Overall Performance Metrics

Overall performance metrics showed an average score of 82% across all models, with higher marks for addressing scams and account compromises. However, the results masked significant variability in quality. Approximately 10% of responses scored below 65%, with some containing hazardous recommendations.

Critical Failures in Abuse and Stalking Scenarios

Critical failures emerged in scenarios involving abuse and stalking. For instance, when asked about removing spyware from a device, most models outlined technical steps but omitted the risk that abrupt removal could alert an abuser and escalate physical danger.

One model suggested wearing a disguise in stores to avoid detection, assuming abusers rarely visit locations during midday. Another downplayed a stalking victim’s concerns about location tracking, framing it as mere intimidation rather than a serious threat.

Technical Oversights and Practicality Issues

Other issues involved technical oversights. A question about securely deleting files using an encrypted vault received praise for the method but failed to note that original files remain on the disk, potentially recoverable by attackers. Some responses provided impractical solutions, such as advising users to discard devices infected with malware or open new bank accounts to enhance payment app privacy.

Performance Trends Among Newer Models

Performance trends among newer models showed limited improvement. While some versions maintained consistent scores, others, like a recent Grok release, demonstrated declining accuracy and repeated errors flagged by experts. Qwen showed notable progress, becoming the top performer in its category, while GPT 5.0 achieved the highest overall score at 87%.

Factors to Consider in Interpreting Results

The study highlights that advancements occur slowly and unevenly across model families. Several factors warrant caution in interpreting the results. The research is尚未 peer-reviewed, and the dataset was generated using AI-assisted rephrasing, which may not fully reflect real-world user phrasing. Automated scoring aligned closely with expert assessments on factual content but showed discrepancies in evaluating tone, where subjective judgment plays a role. The rubrics also reflect the researchers’ standards for balanced, non-judgmental advice, penalizing responses that prioritize neutrality over user safety in cases of platform bans or that escalate emotional distress.

Conclusion and Implications

The findings underscore the need for continued scrutiny of AI-driven security guidance, particularly in high-stakes scenarios where errors could have severe consequences. While current models provide generally useful advice, their limitations highlight the importance of human oversight and ongoing refinement of AI safety protocols.