On sample selection bias and its efficient correction via model averaging and unlabeled examples
Abstract
Sample selection bias is a common problem encountered when using data mining algorithms for many real-world applications. Traditionally, it is assumed that training and test data are sampled from the same probability distribution, the so called "stationary or non-biased distribution assumption." However, this assumption is often violated in reality. Typical examples include marketing solicitation, fraud detection, drug testing, loan approval, school enrollment, etc. For these applications the only labeled data available for training is a biased representation, in various ways, of the future data on which the inductive model will predict. Intuitively, some examples sampled frequently into the training data may actually be infrequent in the testing data, and vice versa. When this happens, an inductive model constructed from biased training set may not be as accurate on unbiased testing data if there had not been any selection bias in the training data. In this paper, we first improve and clarify a previously proposed categorization of sample selection bias. In particular, we show that unless under very restricted conditions, sample selection bias is a common problem for many real-world situations. We then analyze various effects of sample selection bias on inductive modeling, in particular, how the "true" conditional probability P(y|x) to be modeled by inductive learners can be misrepresented in the biased training data, that subsequently misleads a learning algorithm. To solve inaccuracy problems due to sample selection bias, we explore how to use model averaging of (1) conditional probabilities P(y|x), (2) feature probabilities P(x), and (3) joint probabilities, P(x, y), to reduce the influence of sample selection bias on model accuracy. In particular, we explore on how to use unlabeled data in a semi-supervised learning framework to improve the accuracy of descriptive models constructed from biased training samples.