Ursprung: Provenance for large-scale analytics environments
Abstract
Modern analytics has produced wonders, but reproducing and verifying these wonders is difficult. Data provenance helps to solve this problem by collecting information on how data is created and accessed. Although provenance collection techniques have been used successfully on a smaller scale, tracking provenance in large-scale analytics environments is challenging due to the scale of provenance generated and the heterogeneous domains. Without provenance, analysts struggle to keep track of and reproduce their analyses. We demonstrate Ursprung1, a provenance collection system specifically targeted at such environments. Ursprung transparently collects the minimal set of system-level provenance required to track the relationships between data and processes. To collect domain specific provenance, Ursprung enables users to specify capture rules to curate application-specific logs, intermediate results etc. To reduce storage overhead and accelerate queries, it uses event hierarchies to synthesize raw provenance into compact summaries.