For a full list of BASHing data blog posts see the index page.  RSS

How to do replacements based on multiple field values

In a previous BASHing data post I explained how to normalise entries in a field based on the entry in another field. The same command-line method can be used to repair entries based on entries in several other fields in the same record. An example will make it a lot easier to see what this is all about and why this method is so useful.

The TSV "sightings" contains 431030 bird-watching observations. For demonstration purposes I've deleted all the fields except serial ID number, the Australian state or territory in which the observation was made, and the latitude and longitude of the observation:


There are a lot of duplicate locations in this file: same place but different bird, different day, different time or different observer. But a one2manyLL check shows that some lat/lons have been assigned to the wrong state:


If there aren't many errors and the file isn't huge, data problems like these can be located and fixed by hand in a text editor. However, there's a command-line way to do the fixing that works for any number of errors and any size files, and works a lot faster. First, using a text editor I'll reformat the one2manyLL output to make a TSV ("fix") with the correct combinations:


Update 2022-02-20. The "one2many" and "on2manyLL" functions now add a blank line between similar sets of data items. In building "fix", these unneeded blank lines can be deleted, on the command line with awk NF.

If I put "fix" in an AWK array, I can then process "sightings" to find records with matching lat/lons. In those records I replace the "state" entry with the correct value, including in those records which already have the correct value. (Explanation below.)

awk 'BEGIN {FS=OFS="\t"} FNR==NR {a[$1$2]=$3; next} \
$3$4 in a {$2=a[$3$4]} 1' fix sightings > sightings_fixed


(The tally function is described here.)
In a BEGIN statement, AWK is told that both the input field separator (FS) and the output field separator (OFS) is a tab. Specifying the OFS is necessary because field 2 in "sightings" will be rebuilt.
The FNR==NR trick is used to separate the processing of "fix" and "sightings". In processing "fix" first, AWK builds an array "a" whose index string is a concatenation of the lat/lon fields 1 and 2 (a[$1$2]=$3).
Having finished with "fix", AWK moves on to "sightings". Here it looks for records in which a concatenation of the lat/lon fields 3 and 4 appears as an index string in "a" ($3$4 in a). If so, AWK redefines the "state" entry in field 2 as the value string from the array ($2=a[$3$4]).
The final pattern "1" is always true, so AWK does its default action, which is to print the line. It does this for all lines, including the 14 lines modified in the second command.
AFAIK, this concatenation-of-multiple-fields method will work with any number of fields.

Last update: 2022-02-20
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License