Metrics and algorithms for routing questions to user communities
Abstract
An online community consists of a group of users who share a common interest, background, or experience, and their collective goal is to contribute toward the welfare of the community members. Several websites allow their users to create and manage niche communities, such as Yahoo! Groups, Facebook Groups, Google+ Circles, and WebMD Forums. These community services also exist within enterprises, such as IBM Connections. Question answering within these communities enables their members to exchange knowledge and information with other community members. However, the onus of finding the right community for question asking lies with an individual user. The overwhelming number of communities necessitates the need for a good question routing strategy so that new questions get routed to an appropriately focused community and thus get resolved in a reasonable time frame. In this article, we consider the novel problem of routing a question to the right community and propose a framework for selecting and ranking the relevant communities for a question. We propose several novel features for modeling the three main entities of the system: questions, users, and communities. We propose features such as language attributes, inclination to respond, user familiarity, and difficulty of a question; based on these features, we propose similarity metrics between the routed question and the system entities. We introduce a Cutoff-Aggregation (CA) algorithm that aggregates the entity similarity within a community to compute that community's relevance. We introduce two k-nearest-neighbor (knn) algorithms that are a natural instantiation of the CA algorithm, which are computationally efficient and evaluate several ranking algorithms over the aggregate similarity scores computed by the two knn algorithms. We propose clustering techniques to speed up our recommendation framework and show how pipelining can improve the model performance. We demonstrate the effectiveness of our framework on two large real-world datasets.