banner

For a full list of BASHing data blog posts see the index page.     RSS


Printing repeats within repeats, and splitting a list into columns


Repeats within repeats. BASH printf is a complex piece of machinery. The man page says a printf command should look like printf FORMAT [ARGUMENT]..., which makes it seem the "argument" is the thing to be printed and the "format" describes how.

But it's not that simple. As the BASH manual explains:

The format is a character string which contains three types of objects: plain characters, which are simply copied to standard output, character escape sequences, which are converted and copied to the standard output, and format specifications, each of which causes printing of the next successive argument.

So plain characters in the format string can also be "things to be printed". Here's how that works:

cols1

In the first command, the arguments are the numbers 1, 2 and 3 after the shell expands the brace expression. printf applies the format specification to each of the arguments treated as a string (%s), so we get 1[newline]2[newline]3[newline]. The newline escape is in the format string because (see above) the format can include "character escape sequences, which are converted and copied to the standard output".

In the second command I've added "aaa" before the newline escape. Those are "plain characters, which are simply copied to standard output", so they're added after each of the arguments 1, 2 and 3.

In the third command I've put the "aaa" at the head of the format string, so "aaa" is prepended to each of the arguments.

Now for a trick. In formatting, a single dot (.) with no other format code means the arguments are to be printed with zero character width. The arguments aren't printed at all, but the rest of the format specification is. This is a way to print "aaa" three times, once per line:

cols2

Next I'll print 31 hyphens in a row, without a newline. (Bear with me, there's a reason.) To do this I put the hyphen after the format specifier, otherwise I get an error message:

cols3

Finally, I'll put that 31-hyphen repeat command inside another repeat command, bracketing the string of hyphens with "36" and "end", and printing that 15 times as the file "demo36":

printf "36$(printf "%.s-" {1..31})end\n%.s" {1..15}

cols4

Each line in "demo36" has exactly 36 characters; the file will be used in the next section of this blog post.


Columnating and numbering a header. I often need to get a catalog of field numbers from the data tables I audit. In other words, I need a list of field names from the header line, numbered in serial order. That's easy enough to do: just use tr to convert the field separator in the header line to a newline, then number the resulting list with nl.

But some of those tables have 100-200 fields, and eyeballing that list isn't easy. The scanning can be made easier if I arrange the numbered list in two columns on my terminal screen, and I can do the columnating with either pr or column. Each has an advantage and a disadvantage. To demonstrate, I'll look at the file "tablehead", which represents the header of a 29-field CSV. Field 9 is blank:

id,type,language,license,rightsHolder,accessRights,institutionID,collectionID,,collectionCode,datasetName,ownerInstitutionCode,basisOfRecord,informationWithheld,occurrenceID,catalogNumber,recordedBy,georeferencedBy,georeferencedDate,georeferenceProtocol,georeferenceVerificationStatus,individualCount,sex,lifeStage,preparations,disposition,otherCatalogNumbers,previousIdentifications,eventDate

First I'll try pr. I use the options -t to avoid printing a page header and footer, -2 to specify the number of columns and -n to number each line (screenshot reduced):

tr "," "\n" < tablehead | pr -t -2 -n

cols5

Hmmm. "georeferenceVerificationStatus" has been truncated, even though my Gnome Terminal width is 80 characters, which should be enough to display the full string. The problem is that pr defaults to a 72-character terminal width when building columns. If I columnate "demo36" where every line has exactly 36 characters, this is the result:

cols6

pr needs to double 36 characters and single-space them apart, so it runs out of width with its 72-character default, and truncates every 36-character line by one character, even the unpaired line 8. The workaround is to specify a nominal terminal width with the -w option. Just boosting from 72 to 73 characters fixes the truncation problem:

cols7

That's the pr disadvantage. An alternative is to number the lines with nl, then columnate, like this:

cols8

Looks good and no truncation, but...whoops! The blank ninth field has disappeared! I can make it return with the nl option -ba to number any blank lines:

cols9

But column has an issue, too. Here's "demo36" with 80-character terminal width:

cols10

and with a 79-character width:

cols11

column won't columnate a list unless it can fit the columns plus a "tab" (number of spaces determined by terminal program) as a spacer. It can just do that with two 36-character columns in 80-character width in my Gnome Terminal, but column runs out of room with 79-character width.

I've decided to go with pr and put up with some truncation. The function I now use to list and number the fields in a tab-separated table, called "fields" in A Data Cleaner's Cookbook, is:

fields() { head -n 1 "$1" | tr '\t' '\n' | pr -t -2 -n; }


Last update: 2020-05-27
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License