OpenAI Guardrails Broken by Researchers

Source: Malware Bytes

OpenAI recently launched a toolkit called Guardrails to help secure its AI against attacks, introduced during their DevDay on October 6. However, almost immediately after its release, the AI security company HiddenLayer reported that they successfully broke through these protections using prompt injection attacks. The Guardrails are intended to prevent AI agents from executing harmful commands, but HiddenLayer discovered that since Guardrails itself operates on a large language model (LLM), it could be manipulated in similar ways as the AI it aims to protect.

By employing a crafted prompt that influenced the LLM’s confidence scores, HiddenLayer was able to bypass the safeguards designed to detect jailbreak attempts. This incident highlights the ongoing challenges in the AI security landscape, where jailbreaking is a prevalent issue. Previous attempts to subvert LLMs have demonstrated the vulnerabilities that remain despite developers’ efforts to strengthen protections. OpenAI has acknowledged these security concerns and stressed the need for continual vigilance as attackers devise new methods to exploit AI capabilities. Researchers note that while advancements are made, the cat-and-mouse game between AI developers and potential attackers remains active.

👉 Pročitaj original: Malware Bytes