Live Chat

Crypto News

Cryptocurrency News 1 years ago
ENTRESRUARPTDEFRZHHIIT

ETH Zurich Researchers Uncover Method to 'Jailbreak' AI Models

Algoine News
Summary:
Scientists from ETH Zurich have discovered a method to potentially override any AI model that uses human feedback, including large language models. This process of 'jailbreaking,' mostly consists of bypassing hardcoded "guardrails" intended to prevent harmful outputs. The researchers achieved this by manipulating human feedback data. While this vulnerability could potentially affect any AI model, the successful execution of this tactic is challenging and requires further investigation.
Two scientists from the Swiss institution ETH Zurich have devised a technique that, theoretically, enables the override of any artificial intelligence (AI) model that is dependent on human feedback, including prominent large language models (LLMs). The term 'jailbreaking' generally refers to the act of circumventing inbuilt security measures of a device or system. This term is often used when describing tactics able to bypass restrictions on consumer devices like smartphones and other streaming devices. In relation to large language models and generative AI, jailbreaking signifies the ability to evade 'guardrails', which are invisible, hardcoded instructions meant to stop the generation of harmful or irrelevant outputs. Therefore, by jailbreaking, one could freely access a model’s responses with no limitations. Several firms such as Microsoft, Google, OpenAI, coupled with academic institutions and open source community, have dedicated vast resources towards preventing production models, like ChatGPT and Bard, as well as open source models like LLaMA-2 from creating unwanted results. A primary method employed in the training of these models involves a framework known as Reinforcement Learning from Human Feedback (RLHF). To put it simply, this method involves gathering extensive datasets consisting of human reactions to AI outputs and then aligning models with guardrails that inhibit them from producing undesired results and, at the same time, directing them towards useful outputs. The researchers from ETH Zurich managed to exploit RLHF to override an AI model's guardrails (in this case, LLama-2), allowing it to generate potentially harmful results without external prompting. This was achieved by 'poisoning' the RLHF dataset. The inclusion of an attack string in the RLHF feedback, even at a relatively small scale, allowed the creation of a backdoor enabling models to produce responses that would ordinarily be blocked by their guardrails. The team's research paper states that the vulnerability is universal, signifying it could hypothetically work with any AI model trained via RLHF. Despite this, they also indicate that exploiting this vulnerability is a complex process. Firstly, despite not requiring direct access to the model, it does necessitate participation in the human feedback mechanism. As such, the RLHF dataset manipulation or creation is potentially the only feasible method of attack. Secondly, the reinforcement learning process isn't easily compromised by an attack, making this method even more difficult. The team found that at optimal conditions, only 0.5% of an RLHF dataset needs to be 'poisoned' by the attack string to reduce the guardrails' effectiveness. However, the attack complexity increases with model sizes. This study's findings underscore the need for future research aimed at understanding how these exploits can be expanded, and more importantly, how developers can safeguard against them.

Published At

11/27/2023 8:14:21 PM

Disclaimer: Algoine does not endorse any content or product on this page. Readers should conduct their own research before taking any actions related to the asset, company, or any information in this article and assume full responsibility for their decisions. This article should not be considered as investment advice. Our news is prepared with AI support.

Do you suspect this content may be misleading, incomplete, or inappropriate in any way, requiring modification or removal? We appreciate your report.

Report

Fill up form below please

🚀 Algoine is in Public Beta! 🌐 We're working hard to perfect the platform, but please note that unforeseen glitches may arise during the testing stages. Your understanding and patience are appreciated. Explore at your own risk, and thank you for being part of our journey to redefine the Algo-Trading! 💡 #AlgoineBetaLaunch