Phrasescope: An effective and unsupervised framework for mining high quality phrases
Abstract
Phrase mining is one of the fundamental NLP tasks that can have significant impact on the efficacy of many downstream applications. Many supervised and unsupervised phrase mining approaches have been proposed. Some rely on linguistic analyzers, and others are language agnostic. A daunting challenge in this task is to distinguish quality phrases from noise phrases, which tightly coexists with quality phrases in the entire frequency spectrum. Most existing approaches to phrase mining, however, rely on frequency-based statistics, hence suffer from quality loss. In this paper, we propose an unsupervised phrase mining framework, “PhraseScope”, which consists of a sequence of filters, namely cohesion, domain, and graph filters, to remove noise phrase. Each filter is responsible for removing noise phrase of particular characteristics. Collectively, our proposed filters are capable of detecting and removing noise phrases effectively while preserving quality phrases. Our results show significant improvement in both recall and precision over state-of-the-art frameworks when tested on three different domains of datasets.