Takuma Udagawa, Aashka Trivedi, et al.
EMNLP 2023
We present the architecture and data model for Textract, a robust, scalable and configurable document analysis framework. Textract has been engineered as a pipeline architecture, allowing for rapid prototyping and application development by freely mixing reusable, existing, language analysis plugins and custom, new, plugins with customizable functionality. We discuss design issues which arise from requirements of industrial strength efficiency and scalability, and which are further constrained by plugin interactions, both among themselves, and with a common data model comprising an annotation store, document vocabulary and a lexical cache. We exemplify some of these by focusing on a meta-plugin: an interpreter for annotation-based finite state transduction, through which many linguistic filters can be implemented as stand-alone plugins. The framework and component plugins have been extensively deployed in both research and industrial environments, for a broad range of text analysis and mining tasks. © 2004 Cambridge University Press.
Takuma Udagawa, Aashka Trivedi, et al.
EMNLP 2023
Bemali Wickramanayake, Zhipeng He, et al.
Knowledge-Based Systems
Amarachi Blessing Mbakwe, Joy Wu, et al.
NeurIPS 2023
Imran Nasim, Melanie Weber
SCML 2024