Discovering Enterprise Concepts Using Spreadsheet Tables

SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) |

Existing work on knowledge discovery mostly uses natural language techniques to extract entities and relationships from textual documents. However, today relational tables are abundant in quantities, and often contain clean and well-structured data values. So far these rich relational tables have been largely overlooked for the purpose of knowledge discovery. In this work, we study the problem of building concept hierarchies using a large corpus of spreadsheet tables. Our method first groups distinct values from tables into a large hierarchical tree based on co-occurrence statistics. We then “summarize” the large tree by selecting important nodes from the tree that can best “describe” the original table corpus. The end result is a small concept hierarchy that is easy for humans to understand and curate. Our end-to-end algorithms are designed to run on Map-Reduce to scale to large table corpus. Experiments on real enterprise spreadsheet corpus show that proposed approach can generate concepts with high quality.