Use of twitter data toward the development of an English and Swahili question answering agent for the Kenyan customer service market
Abstract
The growth of internet and social media users in the African continent has expanded the channels by which organizations interact with their customers. Although there are multiple implementations of chatbots and other automated question answering systems in operation, their deployment in countries like Kenya has been limited due to implementation challenges arising from the presence of multiple languages (code mixing), underdeveloped training corpora and non-uniform spelling. In this note we describe the development of several components of a chatbot system intended to handle customer service queries in Kenya. We describe the collection and preparation of a custom training corpus developed from twitter data containing English and Swahili messages. The note covers the preprocessing steps to standardize the message format, along with the use of word embeddings for an initial categorization of queries intended to direct the future workflow of the chatbot system. Additionally, the work uses the corpus to measure the accuracy of word embeddings with KNN, and TF-IDF with and without N-Gram to predict correct response to customer questions based on similarity to previously seen questions. The preliminary work measures the ability of the system to preprocess customer inquiries, to broadly classify them, and to either provide an answer automatically or direct the query toward a human, discard the message, or seek further clarification.