Density-based interpretable hypercube region partitioning for mixed numeric and categorical data
Abstract
Consider a dataset of features, such as {SEX, INCOME, RACE, EDUCATION}. A user may want to know where in the feature space observations are concentrated, and where it is sparse or empty. An interpretable region is a “hypercube”, such as {RACE in {Black, White}} & {10 <= EDUCATION <= 13}, containing all observations satisfying the constraints; typically, such regions are defined by a small number of features, say 3 or fewer. To quantify each multivariate observation’s density, we use Gower distance between observations, which works on numeric and categorical features, input into OPTICS. We partition the dataset recursively using regression trees into regions that reflect different average levels of density, which can be ranked. These regions can be useful on their own for manual data exploration, or as input to another application; for instance, an ML model may perform worse in sparse data regions, so by partitioning we may predict the model’s performance for particular feature values. We believe the combination of density-based partitions on mixed-type data that are also interpretable is novel and interesting theoretically. Results are shown on visualizable toy data.