On Efficiently Processing Workflow Provenance Queries in Spark
Abstract
In this paper, we look at how we can leverage Spark platform for efficiently processing fine-grained provenance queries on large volumes of workflow provenance data. Simple recursive querying based Spark solutions involve large data scanning cost and hence do not work well. We propose a novel provenance framework which is engineered to quickly determine a small volume of data containing the entire lineage of the queried data-item. This small volume of data is then recursively processed to figure out the provenance of the queried data-item. We study the effectiveness of the proposed framework on a provenance trace obtained from a financial domain text curation workflow and report our observations. We show that the proposed framework easily outperforms the naive approaches.