Multimodal dialog for browsing large visual catalogs using exploration-exploitation paradigm in a joint embedding space
Abstract
We present a multimodal dialog (MMD) system to assist online customers in visually browsing through large catalogs. Visual browsing allows customers to explore products beyond exact search results. We focus on a slightly asymmetric version of a complete MMD system, in that our agent can understand both text and image queries, but responds only in images. We formulate our problem of "showing the k best images to a user", based on the dialog context so far, as sampling from a Gaussian Mixture Model (GMM) in a high dimensional joint multimodal embedding space. The joint embedding space is learned by Common Representation Learning and embeds both the text and the image queries. Our system remembers the context of the dialog, and uses an exploration-exploitation paradigm to assist in visual browsing. We train and evaluate the system on an MMD dataset that we synthesize from large catalog data. Our experiments and preliminary human evaluation show that the system is capable of learning and displaying relevant products with an average cosine similarity of 0.85 to the ground truth results, and is capable of engaging human users.