Comparing strings more clearly

In a recent data audit, field 19 of a TSV contained a scientific name, and field 20 contained another version of the name plus the scientific authority for that name. In most cases the two name versions were the same, like this:

Anadyomene stellata Anadyomene stellata (Wulfen) C.Agardh

In other cases the two versions weren't the same. Sometimes the species or subspecies names differed, sometimes the genus names and sometimes both:

Ceramium fastigiatum Ceramium cimbricum H.E.Petersen

Codium fragile subsp. tomentosoides

Codium fragile subsp. fragile (Suringar) Hariot

Boergeseniella thuyoides Vertebrata thuyoides (Harvey) Kuntze

Acrosorium uncinatum Cryptopleura ramosa (Hudson) L.Newton

I used AWK to select out the field 19/field 20 pairs where the names differed. To demonstrate this command I'll use a simplified TSV called "demo", with fake scientific names:

ID Name Alt_name+authority 001 Primium vulgare Secundum vulgare Müller 002 Primium Primium De Blas 003 Trivius latum scotensis Trivius latus scotensis Baker 004 Primia Primium De Blas 005 Secundum vulgare Secundum vulgare Müller 006 Primia vulgaris Secundum vulgare Müller 007 Trivius latus scotensis Trivius latus scotensis Baker 008 Primium latum scotense Trivius latus scotensis Baker

awk -F"\t" 'NR>1 {n=split($2,a," "); split($3,b," "); \

for (i=1;i<=n;i++) if (a[i] != b[i]) {print; next}}' demo

This worked fine, but it didn't tell me which of the names were different. A bit of tinkering with AWK led me to a nice couple of solutions. The first method selects the lines with name changes and colorises the "before" and "after" words:

awk -F"\t" 'NR>1 {n=split($2,a," "); split($3,b," "); \

for (i=1;i<=n;i++) {if (a[i] != b[i]) \

{sub(a[i],"\033[1;31m"a[i]"\033[0m",$2); \

sub(b[i],"\033[1;31m"b[i]"\033[0m",$3)}}} /\033/' \

OFS=" | " demo

This first solution shows me the differences between names if I happen to be in a terminal, but it doesn't produce something I can store in a text file. The second solution does that job:

awk -F"\t" 'NR>1 {n=split($2,a," "); split($3,b," "); \

for (i=1;i<=n;i++) if (a[i] != b[i]) \

f = (!f) ? a[i]"|"b[i] : f", "a[i]"|"b[i]} \

f {print $0 "

" f; f=""}' demo

Coming soon... Later in December 2020 I'll be updating in Zenodo the archived versions of A Data Cleaner's Cookbook and all the latest posts in this blog. From there the archive can be downloaded for offline use. Since all the links between the Cookbook and the blog are local in the archived versions, you can use both resources without needing to go online.

Last update: 2020-12-09

The blog posts on this website are licensed under a

Creative Commons Attribution-NonCommercial 4.0 International License