Sampling-based distributed Kernel mean matching using spark
Abstract
Limited access to supervised information may forge scenarios in real-world data mining applications, where training and test data are interconnected by a covariate shift, i.e., having equal class conditional distribution with unequal covariate distribution. Traditional data mining techniques assume that both training and test data represent an identical distribution, therefore suffer in presence of a covariate shift. Kernel Mean Matching (KMM) is a well known approach that addresses covariate shift by weighing training instances appropriately. However, it has time complexity cubic in the size of training data, which is computationally impractical for large or streaming datasets due to limited scalability. In this paper, we present a sampling-based algorithm to address the limited scalability problem of KMM. Moreover, we show that the approach is highly parallelizable, and therefore propose a distributed algorithm for estimating training instance weights efficiently using Spark. Experiment results on benchmark datasets show that the proposed approach achieves competitive estimation accuracy within much lower execution time compared to the KMM algorithm. Moreover, it indicates that larger size of training data results into a higher accuracy with minimal effect on execution time of the proposed approach.