Text-instance graph: Exploring the relational semantics for text-based visual question answering
Abstract
It is time to stop neglecting the text around your world. In VQA, the surrounding text helps humans to understand complete visual scenes and reason question semantics efficiently. Here, we address the challenging Text-based Visual Question Answering (TextVQA) problem, which requires a model to answer the VQA questions with text reading ability. Existing TextVQA methods mainly focus on the latent relationships between detected object instances and scene texts with the given question, but ignore spatial location relationships and complex relational semantics between visual object instances and OCR texts (e.g. the A of B on C). To deal with these challenges, we propose a novel Text-Instance Graph (TIG) network for TextVQA. The TIG builds an OCR-OBJ graph for overlapping relationships modeling, where each node of graph is updated by utilizing relative objects or OCR texts. To deal with the question with complex logic, we propose a dynamic OCR-OBJ graph network to extend the perception space of graph nodes, which grasps the information of non-directly adjacent node features. Considering a scene about “the brand of the computer on the table”, the model would build correlations between “brand” and “table” using “the computer” node as the intermediate node. Extensive experiments on three benchmarks demonstrate the effectiveness and superiority of the proposed method. In addition, our TIG achieves 0.505 ANLS on ST-VQA challenge leaderboard and sets a new state-of-the-art.