Shivashankar Subramanian, Ioana Baldini, et al.
IAAI 2020
As the volume and complexity of scientific literature continue to grow, the need for domain-specialised language representations becomes increasingly critical. This work presents a new extension of the INDUS v2 sentence embedding models, purpose-built for the scientific domains, especially space science. The approach we have taken will significantly enhance sentence-level understanding through domain adaptation and extended context processing, thus addressing key challenges in scientific information retrieval and discovery. The model is trained on an enhanced dataset that integrates peer-reviewed scientific literature and synthetically generated pairs datasets from the Science Discovery Engine (SDE). A key innovation here is that the input context length has increased from 512 to 1024 tokens, thus enabling more robust encoding of longer and denser scientific text passages without compromising inference speed. We also optimised the architecture for low-latency deployment, ensuring suitability for real-time or near-real-time applications such as science-focused search engines and result re-ranking systems. Initial evaluations demonstrate that our domain-specific sentence embedding model outperforms leading baselines—including OpenAI's text-embedding-3-large and nomic-ai/modernbert-embed-base on domain-relevant benchmarks such as question-answering and search-term-to-document matching for Information Retrieval (IR) tasks. Notably, the model exhibits superior capability in capturing nuanced scientific relationships across disciplines, making it particularly well-suited for applications in scientific information retrieval systems, retrieval-augmented generation (RAG) pipelines, and knowledge discovery agents By bridging the gap between general-purpose language models and the precision required in scientific domains, this work contributes to the development of more context-aware, domain-adaptable embedding models for the scientific research community.
Shivashankar Subramanian, Ioana Baldini, et al.
IAAI 2020
Gabriele Picco, Lam Thanh Hoang, et al.
EMNLP 2021
Kevin Gu, Eva Tuecke, et al.
ICML 2024
Hui Wan, Song Feng, et al.
NAACL 2021