banner

For a list of BASHing data 2 blog posts see the index page.    RSS


Table in a PDF to a TSV, on the command line

My wife received a data table from someone who had built the table in Microsoft Word, then exported the document as a PDF. How to convert the table (and not the rest of the document) to a TSV?

This isn't easy to do on the command line, and one of the many online "PDF to [CSV/TSV/spreadsheet]" services might be a good first choice. However, the following method will often work.

To demonstrate the method, I built the "demo.odt" file shown in the screenshot below in LibreOffice Writer. I then exported the file as "demo.pdf".

ODT

If I copy "demo.pdf" to the clipboard and xclip it into a terminal, the tab spacing between columns is replaced with a single space. This confuses the number of columns because the first and second columns had space-separated words:

Tab1

One solution is to use the pdftotext utility from the "poppler-utils" package. With the -layout option, Table 2 gets returned with the table columns separated by lots of spaces (not tabs), although Table 1 still has the "single-space" problem between columns 1 and 2:

firstpass

spacevis is the function
 
spacevis() { sed 's|\x20|\x1b[103m\xc2\xb7\x1b[0m|g'; }

With Table 2, at least, I can now use shell tools to get a tidy TSV:

TSV

pdftotext -layout demo.pdf - | awk '(/^Client/ && ++c==2),0 {gsub(/[ ]{2,}/,"\t"); print}' | sed '/\f/d'

This is one of several ways you could isolate Table 2 in the pdftotext output and replace multiple spaces with tabs. The AWK condition (/^Client/ && ++c==2),0 selects the lines from the second match to "begins with Client" to the end of the file. For those lines, AWK's gsub replaces 2 or more consecutive spaces with a tab, and prints the line. The final command sed '/\f/d' deletes the form feed control character with which pdftotext finishes its conversion.
 
The parentheses in (/^Client/ && ++c==2),0 aren't necessary and are there for clarity. See this BASHing data post.

So the pdftotext method works well with tab-separated tables in a PDF, but not with table objects. I haven't yet puzzled out how to get those table objects into TSVs!


Last update: 2024-03-29
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License