Stream as you go: The case for incremental data access and processing in the cloud
Abstract
Cloud infrastructures promise to provide highperformance and cost-effective solutions to large-scale data processing problems. In this paper, we identify a common class of data-intensive applications for which data transfer latency for uploading data into the cloud in advance of its processing may hinder the linear scalability advantage of the cloud. For such applications, we propose a "stream-as-you-go" approach for incrementally accessing and processing data based on a stream data management architecture. We describe our approach in the context of a DNA sequence analysis use case and compare it against the state of the art in MapReduce-based DNA sequence analysis and incremental MapReduce frameworks. We provide experimental results over an implementation of our approach based on the IBM InfoSphere Streams computing platform deployed on Amazon EC2, showing an order of magnitude improvement in total processing time over the state of the art. © 2012 IEEE.