For a full list of BASHing data blog posts see the index page.     RSS

Sharing data and metadata together

Metadata is data about data. For a data table, the metadata might include information about what the table is, who compiled it, what its sources are, what its field names mean and what its abbreviations stand for. The table's metadata normally stand outside the rows and columns of the table itself, so how can you keep a table and its metadata together, but still make the table data available for re-use?

A common strategy is to have two copies of the table. One copy appears together with its metadata in a larger, shareable file, such as a word-processing document. The second copy is in a separate, linked file: "Click here to download this table as a CSV".

When there's only one copy of a table, the availability of its data depends a lot on how the table+metadata are presented, or rather, on the kind of file that contains them.

If the table+metadata are in a word-processing document, or in a PDF made from that document, there's a problem. For example, here's a screenshot of "age_table.odt", with text from this Wikipedia page:


You can't copy-paste the table into a text file as a table. Pasting a copy gives you each data item on a new line, as seen in this screenshot from a text editor:


The paste could be re-built as a CSV (for example) on the command line. In this case
paste - - - -d"," < pasted_table
will do the job nicely.

For the recipients of a document like "age_table.odt" who want the table data for re-use, a 2-step method is to copy the table and paste it into a spreadsheet. The table-in-a-spreadsheet can then be copy-pasted into a text editor as a TSV, or saved in another plain-text format.

A second approach is to add the metadata in "explainer rows" in a spreadsheet containing the data table; see here for an example. The table data can then be isolated for further use by the copy-paste method, or by copying the spreadsheet and deleting the metadata in the copy. I've also seen spreadsheet workbooks containing various data tables on separate worksheets, and with one metadata-only worksheet explaining all the tables. I suppose technically the data and the metadata are in the same document in that case, but using that document may not be easy.

A little-used alternative to a word-processing document or a spreadsheet is an HTML file. HTML is plain text and doesn't need a word-processing or spreadsheet application for display. An HTML table and its metadata will have a tiny file size, can be viewed in any browser and can have its table data scraped to TSV with a simple copy/paste.

I suspect HTML isn't used much for this purpose because so much reportage and documentation is organised in numbered pages suited to printing on one or both sides of paper sheets. This is true even when almost all users will be viewing the content on a screen, and keeping the file in digital form. Documents in HTML don't need page numbering, because tables of contents and other internal pointers can be hyperlinked. Users of GNU software manuals are offered another choice, as here: view the whole manual as a single webpage, or as one webpage per section.

Another obstacle to HTML use may be that in many applications the "save as HTML" and "export as HTML" options try too hard. The applications (and online "odt-to-html" and "docx-to-html" services) parse and convert the original document almost line by line, resulting in horribly complicated markup.

If I save "age_table.odt" (above) as HTML with LibreOffice Writer 6, I get HTML 4.0 Transitional page code (vintage 1998) with a peculiar styling header and inefficient in-line styling. Here's a screenshot of the last table row in that Writer-built HTML file:


In a browser, the Writer HTML isn't so wonderful a copy of the original, either:


And here's my simpler version - not an exact copy of the ODT file, but a browser-friendly replacement:

<!DOCTYPE html>
<html lang="en">
<title>Simple table</title>
<meta charset="UTF-8">
th,td {border:1px solid gray;padding:5px;}
th {background-color:#5983B0;color:white;}
table {border-collapse:collapse;}
p {font-size:90%;}
<p style="font-size:120%"><strong><em>Simple table</em></strong></p>
<p>The following illustrates a simple table with three columns and nine rows.</p>
<p style="color:red"><strong>Age Table</strong></p>
<tr><th>First name</th><th>Last name</th><th>Age</th></tr>
<p>The first row is not counted, because it is only used to display the column names.<br>
This is called a "header row".</p>

Screenshot from my browser:


Consider putting your tables+metadata in HTML files, either as the principal documents or as separates (with a "Click here for a Web version of this table and its metadata" note), rather than in word-processing documents or spreadsheets. The HTML versions will look good and extracting the table data (copy, paste as TSV) will be easy.

Last update: 2020-07-29
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License