Contextualizing Single-Cell Analyses: An AI Pipeline for Evidence Search from Literature and Gene Databases
Abstract
The increasing availability of single-cell RNA-seq platforms has led to an ever-growing number of research articles and single-cell datasets becoming available. Recent developments in Large Language Models (LLMs) combined with AI techniques such as Knowledge Graphs (KGs), have created new opportunities to search, extract and integrate multi-source information to expand our understanding of single-cell expression analyses. Here we present initial work constructing a novel pipeline for extracting and integrating information to provide context for single-cell analyses. The pipeline comprises several modules to process structured and unstructured data. The open-source IBM DeepSearch toolkit is used to mine literature collections (e.g., PubMedCentral), and a biomedical annotator (UMLS-MetaMap) is used to extract specific information about genes, cell types and function, which in turn helps identify relevant passages in documents. A chain of LLM calls is then used to validate and extract more fine-grained information such as gene clusters from the passages. Finally, we investigate how ontologies and KGs can be used to organize and combine these findings with structured information from publicly available cell marker databases, such as PanglaoDB. Finally, LLMs are used to produce summary reports in natural language. Our experiments focus on understanding human intestinal cells and specific searches were undertaken based on known gene markers (e.g., for Goblet cells) in literature, PanglaoDB and compared against our own independent lab experiment results. We discuss the advantages of the technologies used, and how the proposed pipeline can generate evidence-based contextual information to support both basic research and emerging biomedical foundation models.