Research
6 minute read

Reimagining storage for the generative AI era

IBM Research and IBM Storage are partnering to make storage systems “content aware.” This new paradigm promises to reshape how businesses leverage their vast data estates.

Every business in the world is built on top of data. Sales projections, logistics manifests, payroll documents, user manuals, sales reports, marketing plans, earnings reports — all contain valuable information that helps a business run and plan for the future. These could be used to help businesses move faster, if only they could be quickly found and put to work. But a vast amount of the world’s enterprise data is hidden away in formats that AI systems can’t use, like raw documents, objects, or files.

Retrieval-augmented generation (RAG) has emerged as a leading approach to putting enterprise documents to work. Using RAG, you take unstructured data, extract the content, break it up into smaller chunks, and then run them through an embedding model to generate a vector space representation, which is stored in a vector database. When a user submits a query at inference time, that query itself is converted to a vector and compared against all of the vectors in the vector database to find the most relevant content which an LLM can later use to generate a response to the initial query. Today, all of this is orchestrated outside an enterprise’s storage system.

Screenshot 2025-05-12 at 11.01.29 AM.png
An illustration of offline data processing, resulting in vectorization of enterprise data and population of vector database. Subsequently, relevant enterprise data is retrieved upon a RAG query to be input to an LLM to improve the quality of generative AI responses.

RAG systems are challenging to deploy and manage at scale due to the significant complexity and engineering effort required. The biggest challenges fall into four main categories: data security, data freshness, scaling, and costs. Finding a way to overcome these hurdles inspired a group at IBM Research and IBM Storage to rethink how data is stored for AI uses. The solution is called content-aware storage, or CAS.

A better way

To date, we have always treated storage systems as secure, reliable, and passive managers of bits and bytes. The same data that goes in will reliably come out sometime in the future, and in the meantime, it will be kept safe. We assume that to extract any value from that data, we have to remove it from the storage system and transform it. But this assumption necessarily leads to issues of synchronization. It becomes an incredibly difficult problem to keep two independent systems (a storage system and a vector database) aligned in terms of freshness, security, and efficiency.

This is what has caused the team to fundamentally re-think the role of the storage system itself, as an active participant in this data transformation for AI. Not only does it make sense for storage systems to play this role, it is arguably the most logical place to put this functionality. Solving the security and freshness challenges requires deep integration between the vector database and the system of record, where the data access controls are set and maintained, and where all data changes get recorded, in the first place.

Screenshot 2025-05-12 at 11.01.52 AM.png
An illustration of a content-aware storage system, where storage, a data pipeline, and a vector database are deeply integrated into a single system.

First announced last month at Nvidia GTC, content-aware storage is an entirely new way of thinking about how data is stored for the AI era. The value doesn’t end at just accelerating RAG; transforming the data in the storage system also opens it up to comprehensive searches. Imagine being able to search the contents of the data stored in an object store or file system. Not just the file names, but the actual content. That is now a reality.

How content-aware storage works

This new storage paradigm has four main components: the data processing pipeline, the storage system, the vector database, and dedicated compute for running the processing pipelines and vector similarity search.

Data processing pipeline

The accuracy and throughput of data processing pipelines have matured to the point that it is now conceivable to deploy them at scale over a vast data estate. For example, last year IBM open-sourced Docling, which has rapidly become one of the popular open-source toolkits for converting documents into JSON and Markdown, which are easy for LLMs to digest. Docling now has 29.5k stars on GitHub (and growing), having reached the status of #1 trending repository on GitHub worldwide in November of 2024. In internal benchmarks to date, we have consistently found that Docling delivers higher quality document extraction than many alternatives, leading to overall better RAG accuracy downstream.

To support multi-modal use cases, data pipelines can be augmented with IBM’s Granite 2B Vision model, a powerful multimodal model that was trained on a new instruction-following dataset that improves visual document understanding. Finally, enterprise data can be vectorized with small IBM embedding models, like granite-embedding-30m-english, where the smaller number of vector dimensions makes vector similarity search considerably faster than alternatives commonly found in the industry. Taken together, these state-of-the-art capabilities position IBM’s CAS for accurate extraction and embedding of the data, along with rapid and high-quality outcomes at inference time. We expect to run these models leveraging the Red Hat AI platform.

Secure storage and vector integration

A fundamental premise of CAS is that the storage system and the vector database are no longer managed as two independent systems. CAS eliminates the need to maintain separate access control lists (ACLs) for vectors, and it’s designed to directly utilize the ACLs defined in the storage system when checking a user’s access to a given set of vectors. Since we are directly applying the storage ACL's to the vectors, CAS is able to keep up with any changes that are made to file access and apply them in real-time to the associated vectors. CAS also takes advantage of existing capabilities of IBM Storage Scale, including Active File Management. It triggers notifications to initiate a data pipeline based on data changes, so that the vector database can be updated incrementally, instead of repeatedly processing the entire dataset in large batches. 

Vector search

In a typical RAG scenario, a user’s query — once converted to a vector — gets compared for similarity with a large collection of vectors in a vector database representing enterprise-specific data. The goal is to rapidly but accurately retrieve the top-k (e.g. top-10, or top-100) most relevant vectors. As the number of vectors in the vector database grows, delivering high accuracy and speed together becomes a challenge.

We are approaching this problem from two angles. First, we are challenging ourselves to see how far we can scale real-time search, while maintaining high recall (accuracy). Early prototypes indicate that if we cluster the vectors offline, we can build a scalable index up to 10 billion vectors (384 dimensions) and query them with a mean latency of 60 ms (running on CPU) and recall of 90%. What this means is that we have a path to searching over and accurately retrieving the relevant chunks of around 1 billion documents in real time. We find these results encouraging, as there is still a lot of room for improvement on accuracy in our prototype.

For cases where accuracy is the most important consideration, we are also challenging ourselves to make “exact search” practical, where we evaluate every vector for similarity against the user’s query through a brute-force search, thus assuring we’ll find the most similar vectors. We believe this will be uniquely valuable for enterprise scenarios where finding the exact document you are looking for, even at the expense of a few seconds of latency, has significant business value.

We have built an early prototype showing that indeed we can make this practical leveraging the IBM Spyre AI Accelerator. In our experiments, we have demonstrated a path to searching over 1 billion vectors (384 dimensions) with 100% recall, within a latency window of a few seconds, using a single server populated with 12 cards, all while consuming considerably less power per query than CPU and GPU baselines. This provides IBM a path to delivering the highest recall in the industry that we are aware of, within practical constraints of latency and power consumption.

We envision a dual approach, where we offer scalable, real-time search over a very large volume of enterprise data that returns the desired results the large majority of the time. For cases where recall of 100% is needed, we can deliver that too, in a highly efficient and optimized fashion leveraging unique IBM technology.

Retrofitting existing systems

Perhaps most importantly, users don’t have to move all the data they already have in storage systems, like existing file and object deployments, to a new storage. With IBM Storage and AFM, CAS can directly connect to third-party storage systems (file, object, or HDFS), and retrofit them with 'content awareness.' This means that reams of enterprise data can be opened up for generative AI applications — without having to completely overhaul how that data is being stored.

What’s next

CAS is already available today as part of IBM Fusion, IBM Storage’s “in-a-box” solution for generative AI. The team at IBM is also actively working with partners to pilot this emerging capability and continuously improve the design. With the upcoming advances in CAS scalability, security, and accuracy, we expect this new paradigm will reshape the entire storage industry. Just a few short years from now, we likely won’t even remember what the world looked like before content-aware storage, or how time-consuming setting up enterprise generative AI applications used to be.