Why we’re teaching LLMs to forget things
A new field called large language model unlearning is centered on removing the influence of unwanted data on a trained LLM so that it behaves as if it never saw the information yet retains its essential knowledge.
A new field called large language model unlearning is centered on removing the influence of unwanted data on a trained LLM so that it behaves as if it never saw the information yet retains its essential knowledge.
Next time you forget a name, take a moment to appreciate the fallibility of human memory. Selective forgetting, something that humans are all too good at, turns out to be exceptionally difficult to recreate in machine learning models. That’s especially true for a class of AI models known as foundation models that may have picked up personal, copyrighted, or toxic information buried in their training data.
Machine unlearning as a field emerged about a decade ago in response to new online privacy laws giving citizens the “right to be forgotten.” The laws require companies to scrub personal photos, tracking data, and other information from their models if asked. But the recent rise of foundation models trained on raw, internet-scale data has introduced new complexities.
Selectively removing data from a model without retraining it from scratch turns out to be challenging, both in cases where the model was trained on labeled data, as with traditional ML models, or raw data scraped from the internet, as with LLMs and other generative foundation models.
The size of LLMs, however, makes isolating and removing unwanted facts and concepts especially difficult. Many of these models have trained on terabytes of data and contain tens of billions of weights. Additional problems include removing the influence of undesired data without degrading the model’s performance on other tasks, and proving that the model unlearned just what you intended it to.
Researchers at IBM, with colleagues in academia, lay out the challenging landscape in a new review paper on arXiv. Unlearning, they argue, could be the key to correcting some of generative AI’s intrinsic flaws, which can include ‘hallucinating’ wrong or misleading content, spewing toxic language, and divulging dangerous information.
While techniques like retrieval-augmented generation (RAG) can help to reduce hallucinations, by limiting the model’s responses to documents in a database, RAG can’t do anything to help remove copyrighted, personal, or other kinds of data that pose liability issues. And while HAP detectors and other tools can help flag problematic data, removing it often involves retraining the model from scratch, which can be prohibitively expensive.
The point of LLM unlearning, ultimately, is to erase the data without retraining the model. Many technical challenges remain, however, which is why researchers are also developing methods that reduce or eliminate the unwanted data’s influence without necessarily providing guarantees the data is gone forever.
There are two main strategies for trying to prevent an LLM from leaking unwanted information. The most effective is to change the model’s architecture by readjusting its weights. Weight adjustments can be done before or after alignment and fine-tuning. The other strategy leaves the model intact but targets incoming prompts at inference time.
Unlearning methods that target the LLM’s weights can be thought of as influencing its long-term memory, an intervention that typically leads to more lasting results. Prompt-based techniques, by contrast, are more like trying to tweak its working memory.
Nathalie Baracaldo, an expert in AI security and privacy at IBM Research, and co-author of the review paper, compares prompt-based interventions as telling someone not to answer questions on a sensitive or controversial topic. “We all know that doesn’t work very well,” she said.
“True unlearning tries to remove all vestiges of the unwanted information, so that when the model gets a problematic question, it simply doesn’t have the answer,” she added. “A model that has ‘unlearned’ insulting behavior no longer knows how to be toxic.”
Ideally, unlearning also comes with a mathematical guarantee that the unwanted data’s influence on the model has been erased. Achieving that gold standard, however, typically involves retraining the model, which for LLMs can be prohibitively expensive.
One option for unlearning without guarantees is to fine-tune the model on the unwanted data using an optimization technique known as gradient ascent to forget connections between data points. “Using gradient ascent to update the model’s weights is like running the model’s training in reverse,” said Swanand Ravindra Kadhe, a senior research scientist at IBM Research focused on unlearning. “It cancels out the data’s effect. However, it can also hurt the model’s performance on other tasks.”
Alternatives to gradient ascent have emerged to minimize collateral forgetting. In representation misdirection unlearning, neurons activated by the unwanted data are made to fire randomly, essentially inducing amnesia. At the same time, the model’s preexisting knowledge is reinforced by feeding it a representative sample of its original training data. In task vector negation, the model is tuned on the unwanted data to identify the pattern in its changing weights. To forget the data, the weights are adjusted to reverse the pattern.
In a recent paper published at ICML, researchers showed that both methods could be substantially improved by categorizing the data targeted for unlearning, for example, separating hate speech targeting women from hate speech targeting minorities. “Think of it as trying to forget bad memories tied to hiking separately from a childhood trauma,” said Kadhe, who co-authored the study.
As the field of LLM unlearning evolves, benchmarks for measuring its effectiveness are starting to proliferate. Researchers can now test a model’s capacity for selective amnesia on corpuses with names like Weapons of Mass Destruction Proxy (WMDP), for unlearning dangerous information like instructions for building a bomb, and Enron Email Dataset, for unlearning information about how to commit fraud.
Microsoft tested one of its unlearning algorithms by measuring how well it could forget copyrighted Harry Potter books. But one complicating factor was the many articles and social media posts that mention Harry Potter but aren’t protected by IP laws. Should that information also be erased?
Measuring an LLM’s ability to forget undesirable data is not an exact science. Researchers are trying to resolve several fundamental challenges in addition to defining what information, exactly, to unlearn.
Many models are trained on billions to trillions of tokenized words. How do you verify that your unlearning algorithm caught every last instance? Different data types may also require different verification techniques. One way of testing whether copyrighted material may have slipped into an LLM’s training data, for example, is to run what’s called a membership inference attack to see if it will divulge details it shouldn’t have. Researchers are developing other red teaming methods to check that unwanted data was actually unlearned.
Identifying concepts the model may have lost in the process of unlearning individual data points is another focus of research. How do you identify and measure lost knowledge that can hurt the model’s performance on the tasks you care about most? You don’t want your model to recite lines from Harry Potter, but you may want it to know why quidditch is now played on college campuses. You might also want your model to know that a product exists without revealing its secret formula.
IBM is experimenting with ways of applying unlearning to its Granite family of LLMs. Data filters do a great job at removing unwanted data during pre-training and alignment, but problematic content still slips through. “Unlearning could be another way to remove unwanted behavior,” said Kush Varshney, an IBM fellow who leads IBM’s AI-safety efforts and co-authored the review paper.
Today, many LLMs, including IBM’s open-source Granite models, come with fact sheets describing everything they learned during training. In the future, their permanent record might also include a list of everything they have forgotten.