banner

For a list of BASHing data 2 blog posts see the index page.    RSS


Serial numbering based on changing values in another field

My problem was to make up a special, sortable, unique ID field for a table that was based on values in a couple of existing fields. Here's a simplified example (filename "trees"), a CSV of tree seedlings by genus and species, and the numbers of the invoices on which they were sold:

Acacia,melanoxylon,Inv2616
Eucalyptus,globulus,Inv2732
Acacia,melanoxylon,Inv2733
Eucalyptus,pseudoglobulus,Inv2941
Eucalyptus,globulus,Inv2941
Acacia,melanoxylon,Inv2941
Eucalyptus,pseudoglobulus,Inv3067
Acacia,melanoxylon,Inv3067
Eucalyptus,globulus,Inv3128
Acacia,melanoxylon,Inv3128
Eucalyptus,globulus,Inv3531
Acacia,melanoxylon,Inv3531

The ID format I'll choose are the first 3 letters of the genus (lowercased), the first 3 letters of the species, and a serial number with 3 digit places, and I want the file sorted by ID code. In other words, I want this:

acamel001,Acacia,melanoxylon,Inv2616
acamel002,Acacia,melanoxylon,Inv2733
acamel003,Acacia,melanoxylon,Inv2941
acamel004,Acacia,melanoxylon,Inv3067
acamel005,Acacia,melanoxylon,Inv3128
acamel006,Acacia,melanoxylon,Inv3531
eucglo001,Eucalyptus,globulus,Inv2614
eucglo002,Eucalyptus,globulus,Inv2732
eucglo003,Eucalyptus,globulus,Inv2941
eucglo004,Eucalyptus,globulus,Inv3128
eucglo005,Eucalyptus,globulus,Inv3531
eucpse001,Eucalyptus,pseudoglobulus,Inv2614
eucpse002,Eucalyptus,pseudoglobulus,Inv2941
eucpse003,Eucalyptus,pseudoglobulus,Inv3067

I couldn't do this by hand in the "unsimplified" real-world case because there were way too many records. But on the command line the job is simple:

awk -F"," '{printf("%s%s%0.3d,%s\n",substr(tolower($1),0,3),substr($2,0,3),++a[$1,$2],$0)}' trees | sort -V

trees

AWK processes "trees" line by line, and it knows that the field separator is a comma (-F ",").
 
The only AWK action on each line is to printf 4 items.
 
The first item, formatted as an unmodified string (%s), is the first 3 characters from the genus field (field 1), where the whole field 1 entry has been put in lowercase (substr(tolower($1),0,3)).
 
The second item, also formatted as an unmodified string (%s), is the first 3 characters of the species field (field 2), as-is (substr($2,0,3).
 
Item 3 is a running count of the genus-species combinations, stored in the array "a" (++a[$1,$2]), and is formatted as an integer number padded with zeroes to fill 3 spaces (%0.3d).
 
These first 3 items are run together by printf without separation. Item 4 is the whole line preceded by a comma and followed by a newline (,$0\n).
 
The last step is to sort the AWK output using the -V option for GNU sort, which neatly sorts version numbers like "acamel001", "accamel002" etc.


Next post:
2025-12-19   Copy selected items from a terminal to a text file


Last update: 2025-12-12
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License