Data wrangling: The challenging journey from the wild to the lake
Abstract
Much has been written about the explosion of data, also known as the “data deluge”. Similarly, much of today's research and decision making are based on the de facto acceptance that knowledge and insight can be gained from analyzing and contextualizing the vast (and growing) amount of “open” or “raw” data. The concept that the large number of data sources available today facilitates analyses on combinations of heterogeneous information that would not be achievable via “siloed” data maintained in warehouses is very powerful. The term data lake has been coined to convey the concept of a centralized repository containing virtually inexhaustible amounts of raw (or minimally curated) data that is readily made available anytime to anyone authorized to perform analytical activities. The often unstated premise of a data lake is that it relieves users from dealing with data acquisition and maintenance issues, and guarantees fast access to local, accurate and updated data without incurring development costs (in terms of time and money) typically associated with structured data warehouses. However appealing this premise, practically speaking, it is our experience, and that of our customers, that “raw” data is logistically difficult to obtain, quite challenging to interpret and describe, and tedious to maintain. Furthermore, these challenges multiply as the number of sources grows, thus increasing the need to thoroughly describe and curate the data in order to make it consumable. In this paper, we present and describe some of the challenges inherent in creating, filling, maintaining, and governing a data lake, a set of processes that collectively define the actions of data wrangling, and we propose that what is really needed is a curated data lake, where the lake contents have undergone a curation process that enable its use and deliver the promise of ad-hoc data accessibility to users beyond the enterprise IT staff.