banner

For a full list of BASHing data blog posts see the index page.     RSS


Version 2 of A Data Cleaner's Cookbook is now online. It has more topics and more command-line recipes than version 1 (from 2016) and I think it's better organised.

Getting around a subshell problem

In a blog post last November I wrote about the "toptail" function, which prints the first and last 10 results of a field tally with a blank line in between. The command I used was:

tally [file] [field number] | (head; echo; tail)

The subshell command (head; echo; tail) worked fine, except when it didn't. Sometimes I got the head result and a blank line, but no tail result. Tinkering with the seq command showed that the key to the problem was the size of the list. On my system, a list of numbers from 1 to 1859 was only headed, while 1 to 1869 was both headed and tailed:

subs1

In between, the numbers reported by tail increased one by one, backwards:

subs2

After consulting the Googlemind for ideas, I suspected that the problem might be a limit in the size of the buffer used by the subshell. The head command filled the buffer, leaving nothing for tail to do its processing. As the list for processing got bigger, it reached a threshold beyond which the subshell started asking the system for additional memory.

If that's true, then a list with more bytes per item should escape the buffer problem earlier in processing. Which is what I found:

subs3

Whether or not I'd understood the problem correctly, I needed a workaround. The one I've chosen is based on an uncommon use of sed.

You can emulate head with sed -n '1,10p'. This works by sed first reading the whole input into a buffer, but that's just what I don't want to happen in the subshell. I could use sed '10q', which prints the first 10 lines and then quits, but it still buffers stdin. However, the -u option means (as I understand it) that sed doesn't hold all of stdin in a buffer, just the line it's processing. This worked nicely:

subs4

So my revised "toptail" function is:

toptail() { tally "$1" "$2" | (sed -u '10q'; echo; tail); }

Email me if you know more about this subshell issue, or can think of a better workaround! My system is running BASH 5.0.3 and coreutils 8.30.


Last update: 2020-01-15
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License