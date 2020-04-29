For a full list of BASHing data blog posts see the index page.

Dealing with an all-CAPS/first-CAP jumble

I sometimes need to tally lists of single words in which the same word might appear capitalised or all in capital letters. Here's an example, a 20-line list of plant family names, including some blanks:

BRASSICACEAE



Caryophyllaceae

Asteraceae



Apocynaceae

APIACEAE

Caesalpiniaceae

APIACEAE

Brassicaceae





Apiaceae

CAMPANULACEAE

Caryophyllaceae



Boraginaceae

BIGNONIACEAE

APIACEAE

CARYOPHYLLACEAE

The tally I'd like from this list has the names with the all-caps strings changed to first-cap ones:

5 #These are the blank items in the list

4 Apiaceae

1 Apocynaceae

1 Asteraceae

1 Bignoniaceae

1 Boraginaceae

2 Brassicaceae

1 Caesalpiniaceae

1 Campanulaceae

3 Caryophyllaceae

A neat way to build the tally with GNU sed is shown below; the list above is here called "families":

sed 's/./\L&/2g' families | sort | uniqc

#"uniqc" is an alias; see below

Here's an AWK alternative to the GNU sed method:

awk 'BEGIN {PROCINFO["sorted_in"]="@ind_str_asc"} {a[tolower($0)]++} END {for (i in a) print a[i]"\t"(toupper(substr(i,1,1))substr(i,2))}' families

Although the AWK command is more complicated than the sed one, it's actually a lot faster, even if I restrict the case-changing by sed to lines ending in a capital letter:

sed '/[A-Z]$/s/./\L&/2g' families | sort | uniqc

To test processing times, I'll first multiply the 20-line "families" file 25,000 times, then shuffle the resulting 500,000-line file:

for i in {1..25000}; do cat families >> bigger; done; shuf bigger > jumbo

AWK completes the job on "jumbo" in one-fifth to one-quarter the time required by sed, sort and uniq:

