An Integrated Approach of Deep Learning and Symbolic Analysis for Digital PDF Table Extraction

Mengshi Zhang; Daniel Perelman; Vu Le; Sumit Gulwani

An Integrated Approach of Deep Learning and Symbolic Analysis for Digital PDF Table Extraction

Mengshi Zhang ,
Daniel Perelman ,
Vu Le ,
Sumit Gulwani

International Conference on Pattern Recognition | October 2020

Organized by IEEE

Publication

Télécharger BibTex

Deep learning has shown great success at interpreting unstructured data such as object recognition in images. Symbolic/logical-reasoning techniques have shown great success in interpreting structured data such as table extraction in webpages, custom text files, spreadsheets. The tables in PDF documents are often generated from such structured sources (text-based Word/Latex documents, spreadsheets, webpages) but end up being unstructured. We thus explore novel combinations of deep learning and symbolic reasoning techniques to build an effective solution for PDF table extraction. We evaluate effectiveness without granting partial credit for matching part of a table (which may cause silent errors in downstream data processing). Our method achieves a 0.725 F1 score (vs. 0.339 for the state-of-the-art) on detecting correct table bounds—a much stricter metric than the common one of detecting characters within tables—in a well known public benchmark (ICDAR 2013) and a 0.404 F1 score (vs. 0.144 for the state-of-the-art) on our private benchmark with more widely varied table structures.