Bum Chul Kwon, Janu Verma, et al.
IEE CG&A
To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporallybiased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying exponentially over time. We then periodically retrain the models on the current sample. We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoirbased scheme (R-TBS) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequalprobability sampling schemes. We discuss distributed implementation strategies; experiments in Spark show that our approach can increase machine learning accuracy and robustness in the face of evolving data.
Bum Chul Kwon, Janu Verma, et al.
IEE CG&A
Kevin Beyer, Peter J. Haas, et al.
SIGMOD 2007
Rakesh Agrawal, Peter J. Haas, et al.
SIGMOD 2003
Peter W. Glynn, Peter J. Haas
Commun Stat Theory Methods