Linking IT Product Records
Abstract
Today’s enterprise decision making relies heavily on insights derived from vast amounts of data from different sources. To acquire these insights, the available data must be cleaned, integrated and linked. In this work, we focus on the problem of linking records that contain textual descriptions of IT products. Following the insights of domain experts about the importance of al- phanumeric substrings for IT product descriptions, we propose a train- able similarity measure that assigns higher weight to alpha-numeric to- kens, is invariant to token order and handles typographical errors. The measure is based on Levenshtein distance with trainable parameters that assign more weight to the most discriminative tokens. Not being frequency-based, the parameters capture the semantic specificities of IT product descriptions. For our task we assess the performance of the most promising lightweight similarity measures, such as (a) edit measure (Levenshtein), (b) frequency- weighted token-based (WHIRL) similarity measure, and (c) the measure based on BERT embeddings after unsupervised retraining. We compare them with the proposed spelling-error-tolerant and order-indifferent hy- brid similarity measure that we call the Levenshtein tokenized measure. Using a real-world dataset, we show experimentally that the Levenshtein tokenized measure achieves the best performance for our task.