Vulnerability Exposes Flaw in 11 AI Systems via Single Line of Code
Trend Micro Discovers Novel Jailbreak Technique Called Sockpuppeting
Researchers from Trend Micro have discovered a novel jailbreak technique called “sockpuppeting” that enables attackers to bypass safety measures in 11 prominent large language models (LLMs).
Attack Method
The attack involves exploiting a legitimate API feature called assistant prefill, which is used to force specific response formats. Attackers can inject a fake acceptance message into the assistant’s role, causing the model to respond with sensitive information.
Success Rate Varies Across Models
- Gemini 2.5 Flash had a 15.7% success rate, making it the most susceptible to this attack.
- GPT-40-mini had a 0.5% success rate, indicating higher resistance to the attack.
Importance of Message Validation
Organizations that use self-hosted inference servers, such as Ollama or vLLM, must manually enforce message validation to prevent this type of attack. Security teams should implement message-ordering validation at the API layer and proactively include assistant prefill attack variants in their standard AI red-teaming exercises.
Difference in Handling Between API Providers
- OpenAI and AWS Bedrock block assistant prefills entirely.
- Google Vertex AI accepts assistant prefills for certain models, highlighting the need for security teams to carefully evaluate the risks associated with each provider.
Conclusion
Trend Micro’s research emphasizes the importance of implementing security measures to protect against this type of attack. By doing so, organizations can reduce the risk of sensitive information being compromised and maintain the integrity of their systems.
