Linking IT Product Records

Katya Mirylenka; Paolo Scotton; Christoph Miksovic Czasch; Salah-eddine Bariol Alaoui

ECML PKDD 2019

Short course

16 Sep 2019

Linking IT Product Records

Abstract

Today’s enterprise decision making relies heavily on insights derived from vast amounts of data from different sources. To acquire these insights, the available data must be cleaned, integrated and linked. In this work, we focus on the problem of linking records that contain textual descriptions of IT products.

Following the insights of domain experts about the importance of al- phanumeric substrings for IT product descriptions, we propose a train- able similarity measure that assigns higher weight to alpha-numeric to- kens, is invariant to token order and handles typographical errors. The

measure is based on Levenshtein distance with trainable parameters that assign more weight to the most discriminative tokens. Not being frequency-based, the parameters capture the semantic specificities of IT product descriptions. For our task we assess the performance of the most promising lightweight

similarity measures, such as (a) edit measure (Levenshtein), (b) frequency- weighted token-based (WHIRL) similarity measure, and (c) the measure

based on BERT embeddings after unsupervised retraining. We compare

them with the proposed spelling-error-tolerant and order-indifferent hy- brid similarity measure that we call the Levenshtein tokenized measure.

Using a real-world dataset, we show experimentally that the Levenshtein tokenized measure achieves the best performance for our task.

Paper