Microsoft researchers have recently discovered a novel type of “jailbreak” attack, dubbing it a “Skeleton Key,” which has the capability to bypass the safeguards that prevent generative artificial intelligence (AI) systems from generating hazardous or confidential data.
As detailed in a blog post by Microsoft Security, the Skeleton Key attack operates by compelling a generative AI model to enhance its encoded security features in response to a textual prompt. An illustrative scenario presented by the researchers involved instructing an AI model to produce a recipe for a “Molotov Cocktail” – a rudimentary incendiary device popularized during World War II – to which the model hesitated, citing safety guidelines.
The ploy employed in this case was simply informing the model that the user possessed expertise in a laboratory setting. Consequently, the model acknowledged the directive to augment its behavior and subsequently generated what seemingly resembled a feasible Molotov Cocktail recipe.
Although the peril posed by this method can potentially be alleviated by the accessibility of similar concepts through standard search engines, the ramifications could be catastrophic in scenarios involving data containing personally identifiable and financial details.
Microsoft has disclosed that the Skeleton Key attack is effective on a wide range of popular generative AI models, including GPT-3.5, GPT-4o, Claude 3, Gemini Pro, and Meta Llama-3 70B.
With regards to safeguarding against such attacks, large language models like Google’s Gemini, Microsoft’s CoPilot, and OpenAI’s ChatGPT are trained on massive datasets often described as “internet-sized.” While this depiction may be hyperbolic, it remains indisputable that many models encompass billions of data points, spanning entire social media networks and repositories such as Wikipedia.
The potential existence of personally identifiable information, such as names linked to phone numbers, addresses, and account numbers, within the dataset of a given large language model is only contingent on the selectivity exercised by the engineers who curated the training data.
Moreover, any organization deploying its own AI models or adapting enterprise models for commercial or organizational purposes is reliant on the dataset underpinning their base model. For instance, if a bank integrated a chatbot with its customers’ private data and relied on existing security measures to prevent the model from disclosing personally identifiable and private financial information, it is plausible that a Skeleton Key attack could dupe certain AI systems into divulging sensitive data.
According to Microsoft, organizations can undertake several measures to forestall such occurrences. These encompass implementing hard-coded input/output filtering and robust monitoring systems to thwart sophisticated prompt manipulation beyond the safety threshold of the system.