Shiqiang Wang, Nathalie Baracaldo Angel, et al.
NeurIPS 2022
In this paper, we look at how we can leverage Spark platform for efficiently processing fine-grained provenance queries on large volumes of workflow provenance data. Simple recursive querying based Spark solutions involve large data scanning cost and hence do not work well. We propose a novel provenance framework which is engineered to quickly determine a small volume of data containing the entire lineage of the queried data-item. This small volume of data is then recursively processed to figure out the provenance of the queried data-item. We study the effectiveness of the proposed framework on a provenance trace obtained from a financial domain text curation workflow and report our observations. We show that the proposed framework easily outperforms the naive approaches.
Shiqiang Wang, Nathalie Baracaldo Angel, et al.
NeurIPS 2022
Yaodong Huang, Jiarui Zhang, et al.
ICDCS 2019
Yuan Ma, Scott C Smith, et al.
CLOUD 2024
Umesh Deshpande
ICDCS 2019