Related Table Search for Numeric data using Large Language Models and Enterprise Knowledge Graphs
Abstract
Searching related tables is a crucial part of enterprise data lake exploration. However, data lakes often contain numeric tables with unreliable column headers, and ID columns whose text names have been lost. Finding such related numeric tables in large data lakes is a challenging task. State-of-the-art related table search relies on text values in tables, and cannot be applied on numeric tables. On the other hand, the state-of-the-art for semantic labeling of numeric tables using enterprise knowledge graphs (EKGs) has clear sources of semantic ambiguity due to its heuristic and rule-based approaches for determining numeric types and EKG labels, leading to poor performance. In this paper, we propose a system, NumSearchLLM, that leverages LLMs alongside EKGs to alleviate the ambiguity in semantic labeling of numeric columns and facilitate both joinable table search, and more general table relatedness tasks. Specifically, we use LLMs to: (i) discover new relationships absent from EKGs; (ii) validate numeric types assigned by heuristics; and (iii) check whether the semantic labels assigned to columns of a table form a meaningful schema. We also show how EKGs can be used in conjunction with LLMs to fix labeling inconsistencies discovered by LLMs by finding alternate labels. We show that by an integrated use of LLMs with EKGs, we can achieve superior performance in joinable and related table search tasks in comparison to the current approaches.