Using Machine Learning to Accelerate Data Wrangling
Abstract
70% Of the time spent on data analytics is not actually spent on data analytics, but rather, in data wrangling: The process of finding, interpreting, extracting, preparing and recombining the data to be analyzed. For data that is collected as free-form text, the lack of standards or competing standards often results in a variety of formats for expressing the same type of data, making the data wrangling step a tedious and error-prone process. For example, US street addresses may be expressed with a house number, PO Box, rural or military route, and/or a direction-All of which can be abbreviated or spelled out in a variety of ways. In this paper, we present an algorithm that uses machine learning to efficiently and automatically identify categories of attributes, such as geo-spatial, that are present in a data file and we discuss results on a variety of real data sets. Our implementation can be used to automatically prepare data for consumption by other tools and services, such as mapping and visualization tools, and is motivated by and in support of a customizable severe weather alerting service.