Transform-Data-by-Example: An Extensible Search Engine for Data Transformations

  • ,
  • Xu Chu ,
  • Kris Ganjam ,
  • Yudian Zheng ,
  • Vivek Narrasayya ,
  • Surajit Chaudhuri

International Conference on Very Large Databases (VLDB) |

Today, business analysts and data scientists increasingly need to clean, standardize and transform diverse data sets, such as name, address, date time, and phone number, before they can perform analysis. This process of data transformation is an important part of data preparation, and is known to be difficult and time-consuming for end-users.

Traditionally, developers have dealt with these longstanding transformation problems using custom code libraries. They have built vast varieties of custom logic for name parsing and address standardization, etc., and shared their source code in places like GitHub. Data transformation would be a lot easier for end-users if they can discover and reuse such existing transformation logic.

We developed Transform-Data-by-Example (TDE), which works like a search engine for data transformations. TDE “indexes” vast varieties of transformation logic in source code, DLLs, web services, and mapping tables, so that users only need to provide a few input/output examples to demonstrate the desired transformation, and TDE can interactively find relevant functions to synthesize new programs consistent with all examples. Using an index of 50K functions crawled from GitHub and Stackoverflow, TDE can already handle many common transformations not currently supported by existing systems. On a benchmark with over 200 transformation tasks, TDE generates correct transformations for 72\% tasks, which is considerably better than other systems evaluated. A beta version of TDE for Microsoft Excel is available via Office store.
Part of the TDE technology also ships in Microsoft Power BI.

Benchmark datasets used in this paper have been made available on GitHub https://github.com/Yeye-He/Transform-Data-by-Example (opens in new tab) to facilitate future research.