Research
5 minute read

A faster, systematic way to train large language models for enterprise

IBM’s new synthetic data generation method and phased-training protocol allows enterprises to update their LLMs with task-specific knowledge and skills, taking some of the guesswork out of training generative AI models.

IBM’s new synthetic data generation method and phased-training protocol allows enterprises to update their LLMs with task-specific knowledge and skills, taking some of the guesswork out of training generative AI models.

Modern chatbots have become astoundingly good at generating conversations in the voice of a pirate or summarizing reports in the style of an accountant. But that’s not always the case: they can still go off-topic or provide incorrect information.

Much of their uneven performance comes down to training data. For most bots, that’s raw text scraped from the internet, followed by task-specific information generated by either humans or machines that’s added during fine-tuning (or alignment).

The large language models (LLMs) behind modern chatbots are pre-trained on the raw text to learn an abstract representation of language. This then primes them to learn many tasks quickly once they see labeled, detailed instructions during alignment.

But quality instruction data is hard to come by. It’s laborious and expensive for humans to make, and typically lacks the depth and breadth that chatbots need to guide them through difficult, rare, or ambiguous situations. Synthetic data costs a lot less, but it often suffers from a similar lack of variety.

IBM has a new solution: Large-scale Alignment for chatBots, or LAB. It’s a method for systematically generating synthetic data for the tasks you want your chatbot to accomplish, and for assimilating new knowledge and capabilities into the foundation model — without overwriting what the model has already learned. With LAB, LLMs can be drastically improved in far less time and at a lower cost than is typically spent training LLMs.

“Instruction data is the lever for building a chatbot that behaves the way you want it to,” said Akash Srivastava, chief architect of LLM alignment at IBM Research. “Our method allows you to write a recipe for the problems you want your chatbot to solve and to generate instruction data to build that chatbot.”

A recipe for generating high-quality, large-scale instruction data

IBM’s data-generation method is driven by a taxonomy that allows LLM developers to define the knowledge and skills they want to add to their chatbot. The taxonomy maps out the LLM’s existing knowledge and skills in a logical, hierarchical way, giving developers a framework to identify and fill in gaps with new knowledge and skills.

The taxonomy guides a second LLM, known as the teacher model, in generating high-quality instructions, formulated as pairs of questions and answers tailored to the task at hand. Let’s say you want a chatbot to be able to draft an email for a CEO summarizing their company’s third-quarter earnings. The task requires an understanding of financial statements, basic math and reasoning, and the ability to summarize financial information in an email that strikes the right tone.

IBM’s taxonomy works by segregating instruction data into three overarching categories: knowledge, foundational skills, and compositional skills that draw on knowledge and foundational skills.

Here, the data needed might include accounting knowledge, math skills, and a combination of writing and reasoning abilities for drafting a coherent email. The teacher model would generate instructions for each category while iteratively running quality control on its results.

In the first step of this hypothetical example, the LLM developer might upload the company’s financial statements, and several examples of how to calculate corporate earnings. The teacher model would then generate instructions grounded on the financial documents. This way, if accounting rules change, new instructions can be made.

On a second path, the teacher model generates instructions that will enable the base LLM to calculate the earnings. On a third path, the developer uploads sample earnings-report emails, and the teacher model generates more instructions that will enable the base model to write the desired email.

The teacher model also runs quality control checks on the data it generated. Acting as its own harshest critic, it discards irrelevant questions, and instructions containing incorrect information.

The vetted instructions are then segregated into three buckets — knowledge, foundational skills, and compositional skills — so they can be fed to the LLM in two stages. This graduated training regimen allows the LLM to build on its prior knowledge and skills the same way that we humans progressively expand on what we have learned before.

LAB-taxonomy.png
LAB organizes task-specific knowledge and skills into a taxonomy, allowing developers to identify and fill in gaps with synthetic instruction data.

Toward more comprehensive learning

IBM’s unique training program is designed to help the LLM assimilate new knowledge and skills quickly and efficiently during alignment. Typically, new knowledge is added during pre-training, the most time-consuming and computationally intensive part of AI development.

The model is first fed simple instructions, followed by longer, narrative-like instructions corresponding to the knowledge and foundational skills needed for the target task.

In the second phase, the model is trained on the kinds of task-specific skills needed to write a corporate earnings email, things like summarizing information and putting key details in context. “It turns out that the order matters,” said Srivastava. “We confirmed empirically that the model struggles to assimilate new knowledge if you try to teach it complex skills first.”

The team also found they got better results when they trained the model at a low learning rate, with an extended warm-up, and incorporated the data in large batches. They also used replay buffers, where a small subset of data from early training is reinjected at the end of the process, to prevent the model from overwriting what it learned before.

LAB results

IBM Research generated a synthetic dataset of 1.2 million instructions with the LAB method and trained two open-source LLMs on the data: Labradorite 13B (built on Meta’s Llama-2-13B model) and Merlinite 7B (built on the Mistral 7B model).They found that their aligned models were competitive with state-of-the-art chatbots on a range of benchmarks, including ones for coherent and engaging conversation and common sense reasoning.

IBM’s Labradorite and Merlinite models not only outperformed chatbots aligned on human-generated data, but also models aligned on significantly more synthetic data, including Microsoft’s Orca-2 chatbot, which was trained on 15 million instructions generated by the behemoth GPT-4 model.

IBM also used LAB to significantly improve its own enterprise-focused Granite models on IBM watsonx.

LAB has two distinguishing traits that help explain these results. The teacher model generates synthetic examples from each leaf node of the taxonomy, producing a much broader coverage of target tasks. Other methods use random sampling which limits the breadth of the data generated.

LAB also allows you to add new knowledge and skills to the base LLM without having to incorporate this information into the teacher model as well. “This means you don’t need some all-powerful teacher model that distills its capabilities into the base model,” said David Cox, vice president for AI models at IBM Research.

It also allows LLM developers to generate their own instructions without having to worry about the legality of using proprietary LLMs like GPT-4 to generate synthetic data.

IBM’s LAB method grew out of the team’s insight that great alignment data can bring advanced capabilities to smaller, more cost-effective models that can be tailored for enterprise needs. Pre-training is important, but giving the model highly curated task-specific instructions is just as important.

“The brilliant part of it is that it’s far easier to improve your chatbot during alignment than it is during initial training,” said Cox. “This method levels the playing field, allowing smaller open-source models to compete with models pre-trained on thousands of GPUs and aligned with human-generated instructions.”