PDF table extractor

The PDF table extractor was created as a new tool to address this need.

Description

The library enables the extraction of table structures from a range of pages within a PDF

It provides a list of elements, which can include lines of text or tables

The tables are structured in two dimensions, consisting of individual cells that can be accessed to retrieve their contents.

Code description

With version 3.0 of the library, a more appropriate strategy is followed for determining table cell areas.

A type of edge detection is applied, but more suited to perfectly horizontal and vertical lines.

After obtaining the table edges, additional processing is performed, and the locations of each cell are obtained, followed by the text contained within each of those cells.

This appears to be the final version, pending the detection and collection of uncovered cases and the ability to cover them properly.

Windows

PDF table extractor v1.0 (2024)

Download

PDF table extractor v2.0 (2024-2025)

Download

PDF table extractor v3.0 (2025)

Download

Versions

image

Taking advantage of the classes programmed for the ChessPdfBrowser application, which is an application that scans and extracts chess games from PDFs, I created a beta version of the library for extracting text from PDFs, including tabular elements

The library scans the specified pages and extracts their text. While extracting the text, it searches for tabular patterns and extracts them in a rectangular array format

I hope that this will be useful to someone

image

I have access to several PDFs containing tables that I can experiment with

I've noticed that v1.0 of the library is not very versatile; it works well with some PDFs but not with others

The new library version introduces multiple settings based on trial and error with the test PDFs.

Each setting may work well with certain PDFs and poorly with others.

The goal of the new version is to extract tables using all the created settings and to develop an optimal combination of results by implementing a suitability selector.

This doesn't always result in a perfect extraction, but it can be a good start


If none of the settings lead to a favorable table extraction, don't hesitate to contact me about the possibility of adding a new setting that works with your table.

image
image

The new version of the library appears with the intention of improving table extraction.


The improvement proposed for this version is to detect table edges before doing any processing on the texts, and to extract the texts of each cell knowing their location a priori.

The edges are extracted by applying basic correlations of perfectly horizontal and vertical lines, and with a little extra processing, complete table edges can be extracted.

Once the edges have been obtained, a graph is generated with the immediate connections of each vertex, and by traversing this graph, the areas of the table cells can be recovered.


The library returns an ordered mix of tables and paragraphs that aren't in any table, trying to respect the order in the PDF layout.

The parser can theoretically detect layouts in one or more columns, or a particular combination, which the paragraph parser will infer with a bit of luck.

This should happen without any extra intervention in the calls, simply by using the default constructors.

The other parser constructors take configuration objects with many parameters, so if the parser doesn't work perfectly with your PDF, it's quite possible that it can fix it "simply" by tweaking that configuration object.

It's a difficult task if you're not the library developer, so I'm willing to try to tweak that configuration in case the library doesn't work perfectly with your PDF. (frojasg1@hotmail.com)

Downloads