Data Poisioning

Data & Model Poisoning [Exploring Threats to AI Systems]

7 min read • April 24, 2025

This blog written by CybelAngel analyst Damya Kecili.

Why are the risks related to AI security ignored?

As we all become increasingly dependent on generative artificial intelligence, especially on GPT-like large language models (LLMs), it is essential to be aware of the risks related to the evolution of these new tools.

The development of uncensored LLMs, which we covered here, is concerning cyber professionals.

The Dark Side of Gen AI [Uncensored Large Language Models]

In this blog we are looking at is the data and model poisoning of seemingly harmless AI models.

What is data poisoning?

Data poisoning is an adversarial AI technique, meaning a cyberattack crafted in order to deter AI systems’ functionality, causing them to make incorrect or unintended predictions or decisions. It specifically targets the dataset the LLM is trained on. It consists in including information such as malicious data – malware related data for instance – false information, or omitting relevant information that would positively enrich the LLM’s data set.

Although seemingly similar to the former, model poisoning slightly differs. Instead of focusing on corrupting the training set, model poisoning aims at tampering directly with the model’s internal parameters or updates. Federated learning is a prime target to such poisoning, which we will focus on a bit later in the article.

Mitigating data poisoning attacks in Federated Learning (FL) is challenging for many reasons. Source.

Generative AI tools heavily rely on the dataset they are trained on in order to be rooted in accurate information and to be able to develop into reliable LLMs. Not only that, but malicious actors can also alter a previously sound LLM by modifying or integrating data.

Using Generative AI that has been trained on poisoned data can result in detrimental consequences for a wide range of users across industries.

What are the key parts of data poisoning attacks?

There are various strategies employed by malicious actors when it comes to data poisoning, and they tend to differ depending on the goal of the attackers. Usually, a threat actor will want to either reduce the accuracy of the model as a whole, or only affect the model on specific, targeted tasks.

Targeted data poisoning attacks

This technique involves targeting a specific aspect of the model, without tampering the overall model performance of the LLM in question. The aim of these attacks is to alter a determined bit of dataset that would cause the AI model to misclassify or misinterpret certain data without degrading its general capabilities. This will allow the attacker to tamper with the performance of the model on a specific task, which makes it difficult to detect.

What a typical attack use case looks like

A company is using a machine learning model in order to automatically filter spam emails from legitimate ones. The model is trained continuously using labeled email data gathered from users. A malicious actor might want their phishing emails to systematically bypass the aforementioned filter.

In order to do so, they might try to compromise a few user accounts that contribute to the model’s training dataset. From those accounts, they will submit crafted emails containing common spam content – such as phishing links or fake invoices – but labeled as “not spam”.

They would then repeat this for several rounds, varying the wording slightly but keeping core spam traits. As deep learning and machine learning are being trained assessing such emails as non spam, it will eventually be labeling them as legitimate. We can see here that the aim is not to alter the whole model behavior, but rather to target specific behavior to be poisoned. It is interesting to note that this example is rooted in a number of real world cases, and that it is actually quite commonly used. Back in 2018, Google had already revealed Gmail was facing multiple attempts to poison its spam filter. Malicious actors would send millions of emails in an attempt to confuse the classifier algorithm and modify its spam classification.

The motive behind non targeted attacks

Unlike targeted attacks, the goal here is to deteriorate the performance of the model at a global level, and make it unreliable. Such attacks can lead to dramatic risks for institutions, organizations, and companies, such as systemic failure of critical systems, undetected backdoor obfuscation, and even high-scale sabotage. The more widely used technique by malicious actors for such attacks is usually injecting unrelated noise and malicious data within the dataset of the trained model. This will indeed significantly reduce its ability to generalize from the dataset in question. However, due to its larger scale, it is important to note that non targeted data poisoning attacks are therefore easier to detect, and harder to be effectively put into place.

An example we could think of, related to the field of healthcare: following the case study of the article Medical large language models are vulnerable to data-poisoning attacks, published in Nature Medicine in January 2025, a healthcare structure is training a machine learning system to detect lung diseases, using medical datasets available on The Pile, an open-source dataset of English text created as a training dataset for LLMs.

Reducing the overall accuracy

Here, the goal of the malicious actor would not be to target specific patients, but rather to reduce the overall accuracy of the model. The attackers would input medical misinformation onto The Pile, hindering the accuracy of the information the model is trained on.

The model will eventually learn incorrect associations between image features and diagnoses. They found that a replacement of just 0.001% of training tokens with medical misinformation results in harmful models. This can for instance result in the misdiagnosis of patients. Such data poisoning attacks result in systematic errors, but are very difficult to detect, as they are invisible to current benchmarks.

*Medical large language models are vulnerable to data-poisoning attacks*. *Source.*

Model poisoning

As previously mentioned, model poisoning differs from data poisoning in that it targets the model’s parameters rather than the training set. This distinction makes model poisoning particularly relevant in the context of federated learning.

But what exactly is federated learning?

Traditionally, LLMs are trained by collecting data on a central server. However, this centralized approach raises significant privacy concerns. In order to tackle this issue, the concept of federated learning was developed. This framework enables clients to jointly train a model without having to share data. Federated learning differs from other distributed machine learning frameworks in that each client’s data remains private and inaccessible to others. In a typical setup, a central server first distributes a global model to selected clients.

A central server distributes a global model to clients, who train it locally and send updates back; the server then aggregates these updates to improve the model.
Only the data owner controls their local data, making federated learning promising for user privacy and widely adopted across industries.
However, federated learning is vulnerable to model poisoning attacks, where malicious participants upload manipulated model updates to corrupt the global model.
The large number of clients makes it difficult to ensure all participants are trustworthy, increasing the risk of poisoning attacks during the aggregation process..

*Poisoning Attacks in Federated Learning: A Survey*. Source.

In their research paper titled MPAF: Model Poisoning Attacks to Federated Learning based on Fake Clients, scholars from Duke University introduced MPAF, the first model poisoning attack to federated learning based on fake clients. Their goal was to illustrate the concrete risk of model poisoning attacks faced by federated learning.

In their study, they simulated a model poisoning attack setup, where 1000 real clients were initially used and trained using three real-world, available training datasets (each dataset was used individually for separate experiments and then compared). Then 100 fake clients were added, meant to disturb the general AI model. They made up for 10% of the real clients. The 1100 clients were then used to train a federated learning model. During every training phase, all clients were used by default.

Clients trained with different batch sizes and learning rates depending on the dataset, to make sure models learnt well. The number of training rounds was adjusted depending on how many clients were used in each round. Each test was repeated 20 times to ensure the results were reliable. The results showed that the MPAF model poisoning attack was able to reduce the accuracy of the global model’s output by 32%. The number went as high as 49% when the number of fake clients was raised to 25%.

ConfusedPilot Data Poisoning Attack on RAG AI Systems

In October 2024, researchers from the University of Texas uncovered a new cybersecurity attack method, that they named ConfusedPilot. This method mostly targets Retrieval Augmented Generation (RAG) based AI systems. To put it simply, a RAG-based AI Systems retrieves information from a large collection of documents, databases, or knowledge sources and combines it with the dataset it is trained on to generate a comprehensive, up to date answer. It is a system that uses both external and internal sources. Microsoft Copilot is a good example of a RAG-based AI system: It pulls in information from various external sources, like documents, emails, or data within Microsoft 365 apps (Word, Excel, etc.), and uses the input information to generate responses.

Therefore, in theory, a RAG-based artificial intelligence system will use relevant keywords from a request to search for applicable resources stored in a Vector database. It is a type of database made to store, manage, and search through high-dimensional vectors, which are numerical representations of data like text, images, or audio that generate a response.

However, according to the aforementioned researchers, malicious actors would be able to use the architecture of RAG-based AI systems to their advantage, and manipulate their output through adding irrelevant content among the documents from which the AI system is retrieving its information from. These types of cyberattack could potentially result in widespread misinformation and undermine decision-making within the organization.

ConfusedPilot: UT Austin & Symmetry Systems Uncover Novel Attack on RAG-based AI Systems: Executive Summary Researchers at the Spark Research Lab (University of Texas at Austin)1, under the supervision of Symmetry CEO Professor…

The post… https://t.co/AHQnhGe7LS pic.twitter.com/gq2RRkzR1O
— Global Cyber Threat Intel (@cipherstorm) October 13, 2024

ConfusedPilot: UT Austin & Symmetry Systems Uncover Novel Attack on RAG-based AI Systems. Source.

Four clear cut mitigation strategies

When planning defense strategies to model and data poisoning, it is essential to consider a holistic approach to the issue. It is important to both analyze the origin and history of the data, as well as to ensure implementing algorithms with a strong robustness, able to detect data anomalies linked to such cybersecurity attacks, and ensure data security.

Strict data validation procedures, including tracking data provenance and utilizing reliable sources in real-time, can help prevent the introduction of tainted data.
Cross-validation techniques, such as validating the model on multiple data subsets, can uncover inconsistencies and minimize the risk of overfitting to corrupted data.
Anomaly detection algorithms, such as statistical methods and ml models, can identify suspicious data patterns that indicate potential poisoning attempts.
Regular system audits, which involve continuous monitoring performance metrics and conducting behavioral analysis, help detect early signs of poisoning by revealing unexpected declines in accuracy or vulnerabilities tied to unsolicited data sources.

Additionally, adversarial training techniques, including the use of adversarial examples and defensive distillation, can strengthen the model’s ability to recognize and resist data manipulation. Maintaining data integrity is crucial for ensuring reliable and secure data-driven decision-making, as well as sustaining a competitive advantage in AI and other industries.

Interested in learning more about the work CybelAngel do?

Get in touch

5 Things to Know About the Lidl Data Breach

Cyber Roundup — Week of July 6