Cross Dataset Verification of Cell Type Annotation using a Transcriptomic Biomedical Foundation Model: Inflammatory Bowel Disease Use Case
Abstract
Computational cell type annotation is an essential task for efficiently understating single-cell sequencing data. With advanced machine learning technologies, accurate cell type annotation has been possible by learning cell-type classification models. Transcriptomic foundation models incorporating large public sequencing data and using a variation of BERT have been developed to learn gene expression representation. Such models exhibit high-performance annotation because the learned representation retains essential information from large genetic repositories. Following re-training allows using rather small data associated with cell labels to effectively predict cell types in unannotated datasets. This study provides a cell annotation pipeline for Inflammatory Bowel Disease (IBD) using Biomedical Foundation Model (BMFM) based on scBERT (a variant of BERT). A model is first pre-trained using huge transcriptomic data followed by re-training with a single IBD dataset to predict its cell types. Our re-trained model can subsequently be used to annotate new IBD transcriptomic datasets. Using this pipeline, we examined how cell-types are properly annotated. With a pre-trained models using Panglao DB (contains 1 million cells in various conditions), an IBD dataset of SCP1884 (700K cells) is annotated by cell-type re-trained model by another IBD dataset of SCP259 (360K cells), which gives a cell-type mapping from SCP259 to SCP1884. Examination of the mapping found high concordance and showed that 69% (47 of 68 SCP1884 cell types) are appropriately predicted with corresponding cell type in three lineages (epithelial, immune, and stroma). Our results indicate BMFM-based annotation approach effectively helps understand a huge variation of IBD cells.