IPV: A system for identifying privacy vulnerabilities in datasets
Abstract
The automated discovery of privacy vulnerabilities in large datasets containing person-specific information is an important first step in the privacy-preserving data publishing process and an area of increased interest for commercial data masking offerings. In this paper, we describe Identification of Privacy Vulnerabilities (IPV), a scalable system for automatically analyzing datasets to expose privacy vulnerabilities. IPV provides data owners with a wealth of methods for analyzing their data by offering state-of-the-art algorithms for 1) computing the direct identifiers and the quasi-identifiers of a dataset, as the single attributes and the minimal combinations of attributes, respectively, that lead to few records; 2) calculating the vulnerability index associated with a dataset, by reporting the cardinality of the smallest group of records that share the same values for each combination of attributes; and 3) reporting the specific records in a dataset that contain a combination of unique or rare values. All of these algorithms operate in a parallel, massively multi-threaded fashion and support various hardware configurations, spanning from commodity machines to multi-CPU multi-core nodes in cluster environments. After describing the system, we discuss the algorithms that are currently supported by IPV and provide some examples of their workings. We conclude this paper with a discussion on promising directions for future research in this area that will lead to the improvement of IPV.