Using Machine Learning to Accelerate Data Wrangling

Shilpi Ahuja; Mary Roth; Rashmi Gangadharaiah; Peter Schwarz; Rafael Bastidas

doi:10.1109/ICDMW.2016.0055

ICDMW 2016

Conference paper

02 Jul 2016

Using Machine Learning to Accelerate Data Wrangling

View publication

Abstract

70% Of the time spent on data analytics is not actually spent on data analytics, but rather, in data wrangling: The process of finding, interpreting, extracting, preparing and recombining the data to be analyzed. For data that is collected as free-form text, the lack of standards or competing standards often results in a variety of formats for expressing the same type of data, making the data wrangling step a tedious and error-prone process. For example, US street addresses may be expressed with a house number, PO Box, rural or military route, and/or a direction-All of which can be abbreviated or spelled out in a variety of ways. In this paper, we present an algorithm that uses machine learning to efficiently and automatically identify categories of attributes, such as geo-spatial, that are present in a data file and we discuss results on a variety of real data sets. Our implementation can be used to automatically prepare data for consumption by other tools and services, such as mapping and visualization tools, and is motivated by and in support of a customizable severe weather alerting service.

Conference paper