For a full list of BASHing data blog posts see the index page.

Comparing strings more clearly

In a recent data audit, field 19 of a TSV contained a scientific name, and field 20 contained another version of the name plus the scientific authority for that name. In most cases the two name versions were the same, like this:

Anadyomene stellata Anadyomene stellata (Wulfen) C.Agardh

In other cases the two versions weren't the same. Sometimes the species or subspecies names differed, sometimes the genus names and sometimes both:

Ceramium fastigiatum      Ceramium cimbricum H.E.Petersen
Codium fragile subsp. tomentosoides
      Codium fragile subsp. fragile (Suringar) Hariot
Boergeseniella thuyoides      Vertebrata thuyoides (Harvey) Kuntze
Acrosorium uncinatum      Cryptopleura ramosa (Hudson) L.Newton

I used AWK to select out the field 19/field 20 pairs where the names differed. To demonstrate this command I'll use a simplified TSV called "demo", with fake scientific names:

ID	Name	Alt_name+authority
001	Primium vulgare	Secundum vulgare Müller
002	Primium	Primium De Blas
003	Trivius latum scotensis	Trivius latus scotensis Baker
004	Primia	Primium De Blas
005	Secundum vulgare	Secundum vulgare Müller
006	Primia vulgaris	Secundum vulgare Müller
007	Trivius latus scotensis	Trivius latus scotensis Baker
008	Primium latum scotense	Trivius latus scotensis Baker

awk -F"\t" 'NR>1 {n=split($2,a," "); split($3,b," "); \
for (i=1;i<=n;i++) if (a[i] != b[i]) {print; next}}' demo

Core parts of this command are explained below.

AWK users may be wondering why I didn't just test the longer field to see if it matched the shorter field, like this: $3 !~ $2. That's OK for strings with only alphanumeric charcters, but the matching will throw fatal errors if the shorter string contains something that looks like an invalid range expression or an incompletely round-bracketed one. It will also return a non-match if the shorter field contains "*". The command I used is longer but safer.

This worked fine, but it didn't tell me which of the names were different. A bit of tinkering with AWK led me to a nice couple of solutions. The first method selects the lines with name changes and colorises the "before" and "after" words:

awk -F"\t" 'NR>1 {n=split($2,a," "); split($3,b," "); \
for (i=1;i<=n;i++) {if (a[i] != b[i]) \
{sub(a[i],"\033[1;31m"a[i]"\033[0m",$2); \
sub(b[i],"\033[1;31m"b[i]"\033[0m",$3)}}} /\033/' \
OFS=" | " demo

There are two condition-action parts to the command. The main one (inside the 3 nested curly brackets) has NR>1 its condition, so it only applies to lines after the header.

The main action begins by splitting fields 2 and 3 into arrays "a" and "b", respectively. Each array element will contain a space-separated word. In addition, the total number of words in field 2 is counted and stored in the variable "n". (Besides splitting strings, split also returns the number of array elements it creates.)

The main action now starts a for loop which iterates the variable "i" through "n" (for (i=1;i<=n;i++)). In other words, what happens next will only happen for the words in field 2, and any additional words in field 3 will be ignored.

For each word in field 2, AWK checks to see if it differs from its corresponding word in field 3 (if (a[i] != b[i])). If that's true, then AWK substitutes a colorised version of the word, in both fields. The substitution is done with the sub function and uses ANSI color escapes for the colorising (e.g. sub(a[i],"\033[1;31m"a[i]"\033[0m",$2))

With all those jobs done, AWK turns to the second condition (/\033/). If the escape character is found in the line, the line is printed (default action).

As a final touch for clarity, AWK is told with OFS=" | " as a "pseudo-argument" that the output field separator is space, pipe, space.

This first solution shows me the differences between names if I happen to be in a terminal, but it doesn't produce something I can store in a text file. The second solution does that job:

awk -F"\t" 'NR>1 {n=split($2,a," "); split($3,b," "); \
for (i=1;i<=n;i++) if (a[i] != b[i]) \
f = (!f) ? a[i]"|"b[i] : f", "a[i]"|"b[i]} \
f {print $0 "\n " f; f=""}' demo

This command begins like the first one, but changes the action to be done if the words are different. The action is to define a variable "f" with a ternary "if, do this, else, do that" test. If "f" doesn't exist yet, it's defined as the field 2 word separated from its field 3 version by a pipe. If "f" does exist, it's defined as the existing "f" followed by a comma and a space, then the latest two versions separated by a pipe. That closes the action.

The second condition/action begins with the condition "f", which means "if f exists", and "f" only exists for those lines in which names differ between fields 2 and 3. The action to be taken is simply to print the whole line, a newline and a couple of spaces, then "f". After the printing, "f" is reset to an empty string in preparation for processing the next line in the file.

Last update: 2020-12-09
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License