Preference-aware integration of temporal data
Abstract
A complete description of an entity is rarely contained in a single data source, but rather, it is often distributed across different data sources. Applications based on personal electronic health records, sentiment analysis, and financial records all illustrate that significant value can be derived from integrated, consistent, and queryable profiles of entities from different sources. Even more so, such integrated profiles are considerably enhanced if temporal information from different sources is carefully accounted for. We develop a simple and yet versatile operator, called PRAWN, that is typically called as a final step of an entity integration workflow. PRAWN is capable of consistently integrating and resolving temporal conflicts in data that may contain multiple dimensions of time based on a set of preference rules specified by a user (hence the name PRAWN for preference-aware union). In the event that not all conflicts can be resolved through preferences, one can enumerate each possible consistent interpretation of the result returned by PRAWN at a given time point through a polynomialdelay algorithm. In addition to providing algorithms for implementing PRAWN, we study and establish several desirable properties of PRAWN. First, PRAWN produces the same temporally integrated outcome, modulo representation of time, regardless of the order in which data sources are integrated. Second, PRAWN can be customized to integrate temporal data for different applications by specifying application-specific preference rules. Third, we show experimentally that our implementation of PRAWN is feasible on both "small" and "big" data platforms in that it is efficient in both storage and execution time. Finally, we demonstrate a fundamental advantage of PRAWN: we illustrate that standard query languages can be immediately used to pose useful temporal queries over the integrated and resolved entity repository. © 2014 VLDB Endowment 21508097/14/12.