Optimal distance bounds on time-series data
Abstract
Most data mining operations include an integral search component at their core. For example, the performance of similarity search or classification based on Nearest Neighbors is largely dependent on the underlying compression and distance estimation techniques. As data repositories grow larger, there is an explicit need not only for storing the data in a compressed form, but also for facilitating mining operations directly on the compressed data. Naturally, the quality or tightness of the estimated distances on the compressed objects directly affects the search performance. We motivate our work within the setting of search engine weblog repositories, where keyword demand trends over time are represented and stored as compressed time-series data. Search and analysis over such sequence data has important applications for the search engines, including discovery of important news events, keyword recommendation and efficient keyword-to-advertisement mapping. We present new mechanisms for very fast search operations over the compressed time-series data, with specific focus on weblog data. An important contribution of this work is the derivation of optimally tight bounds on the Euclidean distance estimation between compressed sequences. Since our methodology is applicable to sequential data in general, the proposed technique is of independent interest. Additionally, our distance estimation strategy is not tied to a specific compression methodology, but can be applied on top of any orthonormal based compression technique (Fourier, Wavelet, PCA, etc). The experimental results indicate that the new optimal bounds lead to a significant improvement in the pruning power of search compared to previous state-of-the-art, in many cases eliminating more than 80% of the candidate search sequences.