banner

For a list of BASHing data 2 blog posts see the index page.    RSS


Extract successive pairs from a list, and rapidly grow a list

For a script I'm working on I needed to process a list so that line 1 was paired with line 2 (tab-separated), line 2 with line 3, line 3 with line 4, and so on, in their original order. For an example, the following list ("list") has 10 5-letter words:

scaly
rearm
flesh
inner
gulch
recap
empty
thick
mince
booth

And what I wanted was this:

scaly rearm
rearm flesh
flesh inner
inner gulch
gulch recap
recap empty
empty thick
thick mince
mince booth

As usual in command-line work, there are different ways to accomplish this pairing. Two I looked at are:

paste list <(tail -n +2 list) | sed '$d'
 
awk 'NR==1 {x=$0; next} {print x,$0; x=$0}' OFS="\t" list

The paste command combines the original list with the same list missing the first entry, with a tab between. The last line of the output is "booth [tab] (blank)", which I delete with sed.
 
The AWK command grabs the first line and stores it in the variable "x" before moving to the next line. All succeeding lines are processed with the second part of the command, which prints the preceding line (as stored in "x") and the current line, space-separated, then resets "x" to contain the current line. By default AWK will see the output as two space-separated fields, so OFS="\t" changes the field separator from a space to a tab.

pairs

I was curious to see if there was a significant speed difference, because the first command uses three separate utilities (paste, tail and sed), while the second one only uses AWK. My usual speed-testing method is to time how long a command takes to process a very long list. The fastest way I know to repeat "list" into a huge list is to use the yes command, but IMPORTANT! with a very short timeout to avoid building too large a file!

In this case I built "longlist" with almost 80 million lines (ca 450 MB) in 1 second:

timeout 1s yes "$(<list)" >> longlist

longlist

Now to time the two processing methods with "longlist":

times

OK, the paste/tail/sed command is much faster than the AWK one, although that won't make a significant difference for short lists in my script.


Last update: 2024-05-03
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License