Heterogeneous Data Integration by Learning to Rerank Schema Matches
Abstract
Schema matching is a task at the heart of integrating heterogeneous structured and semi-structured data with applications in data warehousing, process matching, data analysis recommendations, Web table matching, etc. Schema matching is known to be an uncertain process and a common method of overcoming this uncertainty is by introducing a human expert with a ranked list of possible schema matches from which the expert may choose, known as top-K matching. In this work we propose a learning algorithm that utilizes an innovative set of features to rerank a list of schema matches and improves upon the ranking of the best match. The proposed algorithm assists the matching process by introducing a quality set of alternative matches to a human expert. It also serves as a step towards eliminating the involvement of human experts as decision makers in a matching process altogether. A large scale empirical evaluation with real-world benchmark shows the effectiveness of the proposed algorithmic solution.