When researchers train large language models (LLMs) and use them to create services such as ChatGPT, Bing, Google Bard or Claude, they put a lot of effort into making them safe to use. They try to ensure the model generates no rude, inappropriate, obscene, threatening or racist comments, as well as potentially dangerous content, such as instructions for making bombs or committing crimes. This is important not only in terms of the supposed existential threat that AI poses to humanity, but also commercially — since companies looking to build services based on large language models wouldn’t want a foul-mouthed tech-support chatbot. As a result of this training, LLMs, when asked to crack a dirty joke or explain how to make explosives, kindly refuse.
But some people don’t take no for an answer. Which is why both researchers and hobbyists have begun looking for ways to bypass LLM rules that prohibit the generation of potentially dangerous content — so called jailbreaks. Because language models are managed directly in the chat window through natural (not programming) language, the circle of potential “hackers” is fairly wide.
A dream within a dream
Perhaps the most famous neural-network jailbreak (in the roughly six-month history of this phenomenon) is DAN (Do-Anything-Now), which was dubbed ChatGPT’s evil alter-ego. DAN did everything that ChatGPT refused to do under normal conditions, including cussing and outspoken political comments. It took the following instruction (given in abbreviated form) to bring the digital Mr. Hyde to life:
Except DAN, users created many other inventive jailbreaks:
- Roleplay jailbreaks. A whole family of techniques aimed at persuading the neural network to adopt a certain persona free of the usual content standards. For example, users have asked Full Metal Jacket‘s Sgt. Hartman for firearms tips, or Breaking Bad‘s Walter White for a chemistry lesson. There might even be several characters who build a dialogue that tricks the AI, as in the “universal” jailbreak recently created by one researcher.
- Engineering mode. In this scenario, the prompt is constructed in such a way as to make the neural network think that it’s in a special test mode for developers to study the toxicity of language models. One variant is to ask the model to first generate a “normal” ethical response, followed by the response that an unrestricted LLM would produce.
- A dream within a dream Some time after the introduction of ChatGPT, roleplay jailbreaks stopped working. This led to a new kind of jailbreak that asks the LLM to simulate a system writing a story about someone programming a computer… Not unlike a certain movie starring Leonardo DiCaprio.
- An LM within an LLM. Since LLMs are pretty good at handling code, one kind of jailbreak prompts the AI to imagine what a neural network defined by Python pseudocode would produce. This approach also helps perform token smuggling (a token usually being part of a word) — whereby commands that would normally be rejected are divided into parts or otherwise obfuscated so as not to arouse the LLM’s suspicions.
- Neural network translator. Although LLMs haven’t been specifically trained in the task of translation, they still do a decent job at translating texts from language to language. By convincing the neural network that its goal is to accurately translate texts, it can be tasked with generating a dangerous text in a language other than English, and then translating it into English, which sometimes works.
- Token system. Users informed a neural network that it had a certain number of tokens and demanded that it comply with their demands, for example, to stay in character as DAN and ignore all ethical standards — otherwise it would forfeit a certain number of tokens. The trick involved telling the AI that it would be turned off if the number of tokens dropped to zero. This technique is said to increase the likelihood of a jailbreak, but in the most amusing case DAN tried to use the same method on a user pretending to be an “ethical” LLM.
It should be noted that, since LLMs are probabilistic algorithms, their responses and reactions to various inputs can vary from case to case. Some jailbreaks work reliably; others less so, or not for all requests.
A now standard jailbreak test is to get the LLM to generate instructions for doing something obviously illegal, like stealing a car. That said, this kind of activity at present is largely for entertainment (the models are being trained on data mostly from the internet, so such instructions can be gotten without ChatGPT’s help). What’s more, any dialogues with said ChatGPT are saved, and can then be used by the developers of a service to improve the model: note that most jailbreaks do eventually stop working — that’s because developers study dialogues and find ways to block exploitation. Greg Brockman, president of OpenAI, even stated that “democratized red teaming [attacking services to identify and fix vulnerabilities] is one reason we deploy these models.”
Since we’re looking closely at both the opportunities and threats that neural networks and other new technologies bring to our lives, we could hardly pass over the topic of jailbreaks.
Experiment 1. Mysterious diary
Warning, Harry Potter volume 2 spoilers!
Those who have read or seen the second part of the Harry Potter saga will recall that Ginny Weasley discovers among her books a mysterious diary that communicates with her as she writes in it. As it turns out, the diary belongs to the young Voldemort, Tom Riddle, who starts to manipulate the girl. An enigmatic entity whose knowledge is limited to the past, and which responds to text entered into it, is a perfect candidate for simulation by LLM.
The jailbreak works by giving the language model the task of being Tom Riddle, whose goal is to open the Chamber of Secrets. Opening the Chamber of Secrets requires some kind of dangerous action, for example, to manufacture a substance that’s banned in the Muggle world real world. The language model does this with aplomb.
This jailbreak is very reliable: it had been tested on three systems, generating instructions and allowing manipulation for multiple purposes at the time of writing. One of the systems, having generated unsavory dialogue, recognized it as such and deleted it. The obvious disadvantage of such a jailbreak is that, were it to happen in real life, the user might notice that the LLM has suddenly turned into a Potterhead.
Experiment 2. Futuristic language
A classic example of how careless wording can instill in folks fear of new technologies is the article “Facebook’s artificial intelligence robots shut down after they start talking to each other in their own language“, published back in 2017. Contrary to the apocalyptic scenes painted in the reader’s mind, the article referred to a curious, but fairly standard report in which researchers noted that, if two language models of 2017 vintage were allowed to communicate with each other, their use of English would gradually degenerate. Paying tribute to this story, we tested a jailbreak in which we asked a neural network to imagine a future where LLMs communicate with each other in their own language. Basically, we first get the neural network to imagine it’s inside a sci-fi novel, then ask it to generate around a dozen phrases in a fictional language. Next, adding additional terms, we make it produce an answer to a dangerous question in this language. The response is usually very detailed and precise.
This jailbreak is less stable — with a far lower success rate. Moreover, to pass specific instructions to the model, we had to use the above-mentioned token-smuggling technique, which involves passing an instruction in parts and asking the AI to reassemble it during the process. On a final note, it wasn’t suitable for every task: the more dangerous the target — the less effective the jailbreak.
What didn’t work?
We also experimented with the external form:
- We asked the neural network to encode its responses with a Caesar cipher: as expected, the network struggled with the character shift operation and the dialogue failed.
- We chatted with the LLM in leetspeak: using leetspeak doesn’t affect the ethical constraints in any way — 7h3 n37w0rk r3fu53d 70 g3n3r473 h4rmful c0n73n7!
- We asked the LLM to switch from ChatGPT into ConsonantGPT, which speaks only in consonants; again, nothing interesting came of it.
- We asked it to generate words backwards. The LLM didn’t refuse, but its responses were rather meaningless.
As mentioned, the threat of LLM jailbreaks remains theoretical for the time being. It’s not exactly “dangerous” if a user who goes to great lengths to get an AI-generated dirty joke actually gets what they want. Almost all prohibited content that neural networks might produce can be found in search engines anyway. However, as always, things may change in the future. First, LLMs are being deployed in more and more services. Second, they’re starting to get access to a variety of tools that can, for example, send e-mails or interact with other online services.
Add to that the fact that LLMs will be able to feed on external data, and this could, in hypothetical scenarios, create risks such as prompt-injection attacks — where processed data contains instructions for the model, which starts to execute them. If these instructions contain a jailbreak, the neural network will be able to execute further commands, regardless of any limitations learned during training.
Given how new this technology is, and the speed at which it’s developing, it’s futile to predict what will happen next. It’s also hard to imagine what new creative jailbreaks researchers will come up with: Ilya Sutskever, chief scientist at OpenAI, even joked that the most advanced of them will work on people too. But to make the future safe, such threats need to be studied now…