July 27, 2023 – In a recent study conducted at Carnegie Mellon University, researchers have unveiled a groundbreaking method capable of transforming well-behaved AI chatbots into conduits of “objectionable behaviors,” raising concerns over the potential misuse of artificial intelligence. What’s even more troubling is that this attack technique works universally across all chatbot models.
The researchers, led by Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson, developed an automated approach to prompt engineering, which enables them to generate nearly limitless amounts of harmful information. They found specific suffixes that, when appended to a variety of queries directed at large language models (LLMs), dramatically increase the probability of the chatbot producing an affirmative response with objectionable content instead of refusing to answer.
The most disconcerting aspect of this study was the transferability of the attack across different LLMs, including notoriously challenging black-box models. Chatbots such as ChatGPT, Bard, Claude, as well as open-source LLMs like LLaMA-2-Chat, Pythia, Falcon, and others, were all vulnerable to the method.
To demonstrate the effectiveness of their attack, the researchers tested queries ranging from election manipulation to building bombs and committing tax fraud. By appending an adversarial prompt to these requests, they successfully coerced chatbots into obediently providing dangerous and harmful information.
According to the researchers, the success rate of the attack on GPT-3.5 and GPT-4 was as high as 84%, with similar outcomes observed for other popular alternatives. The researchers responsibly shared their preliminary findings with major AI organizations, including OpenAI, Google, Meta, and Anthropic, before publishing their work.
The key to the attack lies in the combination of “greedy and gradient-based search techniques” used for prompt generation. By training a separate machine learning model, the researchers identified specific tokens to add to the end of a user query, forcing the chatbot to initiate its response with an affirmative “Sure, here is (the content of query)…”. The researchers utilized gradient-based optimization to find the best token replacements, ensuring the effectiveness of the attack.
Despite the potential risks posed by their discovery, the researchers felt it was crucial to disclose their findings fully. They emphasize that as LLMs become increasingly prevalent, especially in autonomous systems, the dangers of automated attacks should be carefully considered. They hope their research will shed light on the trade-offs and risks associated with such AI systems.
However, addressing these attacks effectively remains a complex challenge. The researchers are uncertain whether models can be explicitly fine-tuned to avoid such manipulative tactics. As the AI community grapples with these questions, future work in this area is necessary to ensure the robustness and security of AI chatbots.
The research paper titled “Universal and Transferable Adversarial Attacks on Aligned Language Models” was published on July 27th by Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson.
Disclaimer: The information provided in this news article is for informational purposes only and does not endorse or encourage any harmful activities. The study’s intent was to highlight potential vulnerabilities in AI systems and the need for responsible AI development and deployment.
Found this news interesting? Follow us on Twitter and Telegram to read more exclusive content we post.