For a full list of BASHing data blog posts see the index page.

Two data formatting tweaks (updated)

Nearly all the data files I audit are tab-separated plain text (TSV). Tabs are wonderful field separators but they're invisible, and sometimes it helps to see where one field ends and another begins. This post describes a couple of the methods I use to make TSVs more eye-friendly.

barsep. The first tweak is a sed command that simply replaces all tab characters with [space][pipe][space]:

alias barsep="sed 's/\t/ | /g'"

barsep is an afterthought. In other words, I'll use a command that generates a tab-separated output, then think "Hmm, that would look clearer if it was [space][pipe][space]-separated", particularly if the output is going to be copy-pasted into a document. So I repeat the command and pipe the output to barsep, as shown above.

Update. The "one2manyh" and "one2manyLLh" functions described below have been modified in A Data Cleaner's Cookbook so that sets of similar items are separated by a blank line. The modifications are shown and explained here.

one2manyh. The second tweak is based on the one2many function, which I use a lot in auditing. one2many scans a TSV looking for records in which unique values in the "one" field have more than one corresponding value in the "many" field. It returns the number of records for each "one" to "many" pair.

For example, in the table "oc" the "scientific name" field (field 14) contains species names, and each species should have one and only one corresponding family name in the "family" field (field 19). But one2many shows that one record for Plectrophenax nivalis has the incorrect family "Locustellidae", while 18 records have the correct "Calcariidae":

I might know today what fields 14 and 19 are, but I'll forget tomorrow, and copy-pasting results like that into an audit report means that a header line with field names would be useful. Here's the basic one2many, explained in this BASHing data post:

one2many() {
awk -F"\t" -v one="$2" -v many="$3"
'$one != "" {a[$one]++; b[$one][$many]++}
END {for (i in b) {for (j in b[i])
{if (a[i]>b[i][j]) print b[i][j] FS i FS j}}}' "$1"
}

The tweak has AWK print "No. of records" and the two field names from the header line of the table (NR==1) before checking for one-to-many relationships, and the AWK output is fed to barsep's command. The tweaked function is called one2manyh:

one2manyh() {
awk -F"\t" -v one="$2" -v many="$3"
'NR==1 {print "No. of records" FS $one FS $many; next}
$one != "" {a[$one]++; b[$one][$many]++}
END {for (i in b) {for (j in b[i])
{if (a[i]>b[i][j]) print b[i][j] FS i FS j}}}' "$1"
| sed 's/\t/ | /g'
}

one2manyLL is a variant of one2many in which the "one" field is actually a pair of fields, like a latitude field and a longitude field. The tweaked version of one2manyLL (one2manyLLh) has the added line shown below; see above for where it fits in.

NR==1 {print "No. of records" FS $lat","$lon FS $verbal; next}

Here's one2manyLLh at work on another TSV called "oc", with fields "decimalLatitude" (27), "decimalLongitude" (28) and "locality" (24):

Last update: 2022-02-20
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License