Fast Record Linkage for Company Entities
Abstract
Record linkage is an essential part of nearly all real-world systems that consume structured and unstructured data coming from different sources. Typically no common key is available for connecting records. Massive data integration processes often have to be completed before any data analytics and further processing can be performed. In this work we focus on company entity matching, where company name, location and industry are taken into account. Our contribution is a highly scalable, enterprise-grade end-to-end system that uses rule-based linkage algorithms in combination with a machine learning approach to account for short company names. Linkage time is greatly reduced by an efficient decomposition of the search space using MinHash. Based on real-world ground truth datasets, we show that our approach reaches a recall of 91% compared to 73% for baseline approaches, while scaling linearly with the number of nodes used in the system.