Semantic Concept Annotation for Tabular Data
Abstract
Determining the semantic concepts of columns in tabular data is of use for many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Several prior works have proposed supervised learning-based or heuristic-based approaches to semantic type annotation. These techniques suffer from poor generalizability over a large number of concepts or examples. Recent neural network based supervised learning methods generalize to different datasets but require large amounts of curated training data and also present scalability issues. Furthermore, none of the known methods works well for numerical data. We present C2, a system that maps each column to a concept based on a maximum likelihood estimation approach through ensembles. It is able to effectively utilize vast amounts of, albeit somewhat noisy, openly available table corpora in addition to two popular knowledge graphs (Wikidata and DBpedia), to perform effective and efficient concept annotation for tabular data. Specifically, we utilize a collection of 32 million openly available webtables from several sources. We also present efficient indexing techniques for categorical string, numeric and mixed-type data, and novel techniques for table context utilization. We demonstrate the effectiveness and efficiency of C2 over available techniques on 9 real-world datasets containing a wide variety of concepts.