Workshop paper

SemCLIP: A Semantic Memory-Aligned Vision Language Model

Abstract

Vision-language models (VLM) bring image and textual representations close together in a joint embedding space, which is useful for tagging and retrieval from content stores. However such associations are not very stable in that a synonymous textual query does not retrieve the same set of images or with a high degree of overlap. This is due to the absence of linkages between semantically related concepts in vision-language models. In contrast, the episodic memory store in the brain has linkages to the semantic conceptual memory subsystem which helps in both the formation and recall of memories. In this paper, we exploit this paradigm to link a VLM to a semantic memory thereby producing a new semantic vision-language model called SemCLIP. Specifically, we develop a semantic memory model for the language of object-naming nouns reflecting their semantic similarity. We then link a vision language model to the semantic memory model through a semantic alignment transform. This leads to a richer and more stable understanding of the concepts by bringing synonymous visual concepts and their associated images closer. Both the semantic memory model and the alignment transform can be learned from word knowledge sources thus avoiding large-scale retraining of VLMs from real-world image-text pairs. The resulting model is shown to outperform existing embedding models for semantic similarity and downstream tasks of retrieval on multiple datasets.