For a full list of BASHing data blog posts, see the index page.     RSS

Fancy numbering of records

In A Data Cleaner's Cookbook I demonstrate a few of the ways you can give records a serial number using the nl command — for example:

nl demo

With nl alone you can't add a string before a number. To do that, or to embed the serial record numbering within a string, you can use seq with its -f option ("printf" formatting) and paste the result to the list of records:

paste <(seq -f "Record_%03g" 3) nl_file
paste <(seq -f "Record_%03g_of_3" 3) nl_file

seq demo

Rather than serial record numbers, you can add a field with unique "random" numbers, like a UUID. A simple way to do this is with a while loop:

while read line; do echo -e "$(uuidgen)\t$line"; done < nl_file
while read; do echo -e "$(uuidgen)\t$REPLY"; done < nl_file


For a description of the uuidgen command, see its man page.

The loop trick can also be used to put the UUID at the end of the record:

while read line; do echo -e "$line\t$(uuidgen)"; done < nl_file


or in the middle, with a bit of cutting and paste-ing:

while read line; do paste <(cut -f1 <<<"$line") <(echo "$(uuidgen)") <(cut -f2 <<<"$line"); done < nl_file


I find UUIDs are overkill for unique record numbering. For a random string of 10 alphanumeric characters I would use

head -n3 /dev/urandom | tr -cd '[:alnum:]' | cut -c -10

like this:

while read line; do echo -e "$(head -n3 /dev/urandom | tr -cd '[:alnum:]' | cut -c -10)\t$line"; done < nl_file


What if your data table has a header line and you don't want the header numbered? See the "Ignore a header" section of the header page in A Data Cleaner's Cookbook.

Interesting numbering problems arise with irregular multi-line records, as in the file "records" shown below. The records are irregular in this case because they might occupy 2, 3 or 4 lines, although they are always separated with "---":


The fake names and addresses were generated with the rig (random identity generator) command-line program.

How to insert a serial number before each record? In this case I've added a 4-digit number with AWK:

awk 'NR==1 {$0="0001" RS $0} /^---/ {$0=$0 RS sprintf("%04d",(++c)+1)} 1' records


At the first line (NR==1) AWK redefines the whole line ($0) as the string "0001" followed by a record separator (RS; the AWK default is a newline) followed by the original whole line ($0).
Each time AWK finds a line beginning with 3 hyphens (^---) it redefines the line as the whole line ($0) followed by a record separator (RS) followed by the result of a printing command. "sprintf" is used here because I'm defining something, rather than printing it to stdout as "printf" does. What I'm defining is a pre-incremented count of lines beginning with 3 hyphens (++c) plus 1 (because the first record preceded by the "---" will be numbered "0002"), formatted as an integer with zeroes padding out to 4 places (%04d").
All other lines are also printed, thanks to the final "1" in the AWK command.
For an explanation of pre-increment and post-increment counting, and another example of "folding" a count into a printing command, see this earlier BASHing data post.

Last update: 2019-02-17