I like big tables

I like tables. I use it in my notes, reports and software documentation related to cost/performance/estimation related information. Not just me, everyone like tables as well.

Buzz Lightyear saying "tables everywhere"

The business and key players

Docsumo capabilities
Nanonets capabilities

Past approaches

  1. Extracting Table with Native text: open-source python libtrary Tabula allows you do that. If you can select the text in a pdf with cursor, Tabula can extract that. This approach used pdf documents internal structure to extract the text entities and build table from that.
  2. Classical ML approach: Train a supervised ML model on hand tuned features (like vertical and horizontal spaces between texts, visible lines etc). This model then can output table region/rectangle coordinates, cell boxes etc.
  3. CNN-based approaches: Treat table pdf as an image and run CNN to output table table regions and cells and corresponding text and tables. You can further use this info to construct a table.
  4. Transformer-based approach: in 2022, Microsoft released a table-transformer model along with the largest labelled dataset of pdf with tables PubTables-1M. This model achieved good accuracy and could handle various different types of document tables.
  5. Graph Neural Network: Identify cells and corresponding text in it and define edges based on their spatial location. Thus a table pdf can be modelled as a graph, which can later be used to extract the table itself.

Most of the above mentioend companies use a mixture of Neural net(3,4,5) based approaches to process documets and let customer train their custom ML models.
An example stack for document AI stack

deepdoctection architecture

Future

Although we have much better accuracy with transformer and GNN based models, There are few challenges