Conference paper

Text-Guided Few-Shot Semantic Segmentation with Training-Free Multimodal Feature Matching

Abstract

This paper addresses few-shot semantic segmentation (FSS) guided by text, where we classify unseen novel classes using image and text references as in-context examples, without the need for training. We enhance the quality and stability of the segmentation masks generated by FSS by combining the capability of open-vocabulary zero-shot semantic segmentation (ZSS) based on foundation models for image and text. We propose a training-free approach using multimodal feature matching that performs segmentation by identifying regions in a target image that match the features from both the image and text references. Experimental results demonstrate that the proposed method outperforms state-of-the-art FSS and ZSS methods.

Related