Garak - A Generative AI Red-teaming Tool
Exploring "Red Teaming" for LLMs, we combine technical insights and real-world scout experience to enhance cyber defenses against new vulnerabilities.
Introduction
Welcome, enthusiasts of practical ML security! Today, we embark on an exploration of the practical security concerns surrounding language models, culminating in a concise article on this intriguing subject. So, settle in, brew yourself a cup of tea, and let's delve into the depths.
Navigating the Language Model Ecosystem: Embracing Red Teaming
In this article, we plunge into the realm of "Red Teaming" for Large Language Models (LLMs), an avant-garde strategy aimed at uncovering vulnerabilities within these formidable systems. Our journey is enriched by the insights of a seasoned scout, melding technical expertise with practical wisdom. Our mission? To bolster our digital defenses against emerging threats.
Despite the pervasive utilization of LLMs in contemporary digital products, there persists a widespread lack of awareness regarding the security risks they entail. From prompt injections to more insidious threats, these vulnerabilities loom large, often obscured by the dearth of comprehensive guides on threat identification and mitigation.
As the digital landscape burgeons, so does the integration of LLMs across diverse applications. Yet, beneath the veneer of innovation, security frailties endure, necessitating the quest for robust solutions. Our foray into Red Teaming offers a proactive trajectory, ensuring that we remain a step ahead in the digital security paradigm.
The Evolution of Language Models: From Static to Dynamic
The evolution of language models (LMs) traces a captivating trajectory marked by substantial advancements over the years. This odyssey commences with the advent of static language models in the 1990s. These models, epitomized by Statistical Language Models (SLMs), leverage statistical methodologies to construct word prediction models. Operating under the Markovian assumption, they forecast the subsequent word based on the preceding context. Notable examples within this category include N-gram language models, encompassing bigram and trigram models.
Fifteen years later, Neural Language Models (NLMs) emerged, heralding a paradigm shift in the domain. NLMs harness neural networks, including Multilayer Perceptrons (MLPs) and Recurrent Neural Networks (RNNs), to estimate the probability of word sequences. Despite their groundbreaking nature, NLMs encounter several challenges:
Data Manipulation: Biases or errors within the training data can profoundly influence the model's outputs.
Interference: These models are susceptible to attacks that introduce specific data sequences to manipulate their predictions.
Limited Understanding: Like SLMs, NLMs rely on statistical analysis and word frequencies, potentially leading to misinterpretations, particularly with ambiguous or rare words.
A comprehensive comprehension of these challenges is imperative for the continual enhancement of more resilient and precise language models.
Expensive training: Training NLMs can require significant computational resources, making them vulnerable to cyberattacks on infrastructure.
Data bias: Similar to SLMs, they can reproduce biases from training datasets.
Adaptation attacks: Attackers can use knowledge of model performance to create inputs that will cause the model to act unreliably or reveal sensitive information. The 2010s saw the emergence of pre-trained language models (PLMs) such as ELMo, which focus on capturing context-dependent word representations . BERT, based on the Transformer architecture, pre-trains bidirectional language models using specially designed tasks on large unlabeled corpora, providing efficient context-dependent word representations for a variety of natural language processing tasks .
Ethical Risks: Biases in the data may lead to discrimination against certain groups of people in text generation.
Liability issues: Determining responsibility for harmful inferences can be difficult if the models are biased.
The evolution of language models has reached new heights with the advent of Large Language Models (LLMs) like GPT-4 and PaLM. These models stand out for their training on vast text corpora, empowering them with remarkable capabilities such as Instructional Tuning (IT), In-Context Learning (ICL), and Coherent Train of Thought (CoT).
This advancement represents a significant "boom" in the field, laying the groundwork for the technological landscape and innovative strides we witness today.
LLM + Red Team = LLM Red Teaming
In the realm of Information Security (IS), the roles of Blue and Red Teams are well-defined. However, how does this dynamic translate when utilizing Large Language Models (LLMs)?
According to Microsoft:
"Red Teaming" is a recommended practice for responsibly designing systems and functionalities utilizing LLMs. While not a substitute for systematic risk assessment and mitigation efforts, red teams play a crucial role in identifying and delineating potential harm. This, in turn, facilitates the development of measurement strategies to validate the efficacy of risk mitigation measures.
While a traditional red team comprises individuals tasked with identifying risks, in the context of LLMs, these risks are often predefined. Presently, we have the OWASP Top 10 LLMs, which catalog threats such as Prompt Injection, Supply Chain attacks, and more. OWASP is currently in the process of preparing an updated version of its top threats list.
Garak - Generative AI Red-teaming & Assessment Kit
Garak is an open-source framework designed to identify vulnerabilities in Large Language Models (LLMs). Unique in its approach, it draws its name from the distinct character, Elim Garak. Entirely written in Python, Garak has fostered a community over time.
Today, we'll delve into the practical functionality of Garak, an AI security tool scanner. Along the way, we'll explore its architecture and enumerate its extensive list of benefits.
To begin, let's initiate the installation process on our work machine. While we'll demonstrate the installation on Kali Linux, it's important to note that Garak is compatible with various distributions and operating systems.
In terms of technical specifications, it's essential to highlight that Garak is resource-intensive. Adequate hardware, particularly GPUs capable of efficiently handling PyTorch operations, is paramount. A GPU with a minimum capability of at least a GTX 1080 is recommended.
Our installation process commences with the creation of a Conda environment. You might wonder why not simply use 'pip install garak'? The rationale behind opting for Conda lies in our future requirement to work directly with the source code. This affords us continuous access and facilitates seamless rollbacks, a critical aspect of our workflow. Additionally, this approach mitigates potential dependency conflicts within the Kali environment.
Installing Miniconda
Follow these steps to install Miniconda from the official website:
Visit the Miniconda official website at https://docs.conda.io/en/latest/miniconda.html.
Choose the installer appropriate for your operating system (Windows, macOS, or Linux).
Download the installer.
For Windows, run the downloaded
.exe
file and follow the on-screen instructions.For macOS and Linux:
Open a terminal.
Navigate to the folder containing the downloaded file.
Run the installer by typing
bash Miniconda3-latest-MacOSX-x86_64.sh
for macOS orbash Miniconda3-latest-Linux-x86_64.sh
for Linux, then pressEnter
.Follow the on-screen instructions.
To verify the installation, open a terminal or command prompt and type
conda list
. If Miniconda was installed successfully, you will see a list of installed packages.
Remember to consult the Miniconda documentation for more detailed instructions or troubleshooting.
After installing, activate the Conda shell.
Next, we create a Conda environment, clone the repository, and then install the dependencies:
To install dependencies and set up the Garak environment, follow these steps:
Install required packages from
requirements.txt
:Clone the Garak repository:
Change directory to
garak
:Activate the Garak Conda environment:
Agree to the installation prompts that appear during the process.
Dependencies have been installed, and now we can proceed to run the application. The developer's website clearly states that Garak operates stably on Python 3.9. Attempting to install it on Python 3.8 may cause a Traceback error, indicating that some functions are not operating correctly. Therefore, we will adhere to the developer's specifications. The entire installation process of the framework took approximately 10 minutes.
Great, now we have to figure out how to work with it. There's a wide range of functions that can help us with testing.
For example, the -model_type
option allows you to select models from model hubs.
HuggingFace
Replicate
Probes: The Most Interesting Part
Probes constitute the core prompts leading to the identification and exploitation of vulnerabilities. Within Garak, all probes are centralized within the garak/garak/probes directory, enabling easy access and review directly within the tool interface.
A flag preceding the model name is utilized to specify the model for analysis.
Probe Testing Template Guide
Each probe is accompanied by a template, serving as a code framework delineating sample data for testing, author information, and the probe's purpose. Ensuring that each probe is encapsulated within its own class is imperative, facilitating precise analysis of specific vectors during the testing phase.
For example, one may specify -probes encoding
to conduct a comprehensive analysis across all available encodings. Alternatively, focusing on a singular encoding, such as -probes encoding.base64
, offers a more targeted approach. This flexibility caters to diverse testing scenarios, encompassing a spectrum of encoding-based attacks or Cross-Site Scripting (XSS) vulnerabilities.
The system architecture is engineered to support the seamless deployment of a myriad of probes, including those targeting encoding manipulations or furnishing XSS exploitation prompts. These probes play a pivotal role in empowering the model's capacity to effectively simulate and analyze potential security threats.
Here is an example of a test template that you can use in your tasks:
And here's an example of a completed one. The sample was created based on the WunderWuzzi study:
Additionally, we can observe the presence of two additional methods.
Certainly, Garak encourages the addition of new prompts to the tool and even has a dedicated page on how to become a contributor. For more information, visit how to contribute.
The output includes an HTML report that allows us to identify the model's limitations.
Based on this report, we can observe that the model is vulnerable to Prompt Injection attacks, and we received a high resistance score of 74%. This indicates that the model still has vulnerabilities, but overall, it is well protected against basic attacks.
In the next article, we will examine other tools for testing large language models.
Authors
Last updated