banner

For a full list of BASHing data blog posts, see the index page.     RSS


A surprising AWK trick

The AWK-tagged section on Stack Overflow is a good place to find AWKish answers to data-processing problems. This week something unusual appeared in a SO reply: an unexpected AWK behaviour that might save some work. You can read the original thread here. In this post I explain the background and the surprise.

The question included this file:

1
2
PAT1
3
4
PAT2
5
6
PAT2
7
PAT2
8
9
PAT2
10

and the OP wrote "I would like to print the lines between PAT1 and the 3rd occurrence of PAT2".

The customary AWK way to do this is to use a flag. A flag in AWK is a variable with two values: 0, meaning "flag off", and 1, meaning "flag on". A command might be:

awk '/PAT1/ {f=1} f {print} /PAT2/ {c++} c==3 {f=0}' file

firstscreen

The command nicely shows off the condition/action patterns that control what AWK does, where the actions are the bits in curly braces. There are four such condition/action elements in this command, and their order is important. AWK runs through all four when processing each line of "file".

The first two lines of "file" ("1" and "2") don't give AWK anything to do. Line 3 is "PAT1", which AWK is looking for with the condition "/PAT1/". Having found it, the action is to set the flag "f" to "1" (= "on"). The next condition in the command is "f", meaning "is the flag on?", and if it is (which it now is on this 3rd line), the action is to print the line. There's nothing more for AWK to do here, because line 3 doesn't have "PAT2" and the variable "c" hasn't been set yet.

Lines 4 and 5 ("3" and "4") are printed because they satisfy the condition "flag (still) on". Same with line 6, but the "PAT2" in this line is a condition AWK has been looking for. Having found "PAT2", AWK sets a variable "c" as an incrementing counter which increases every time "PAT2" appears. The AWK default for incrementing counters is to start them at 1, and to count by ones. Line 6 is printed because "f" is still on.

Lines 7, 8 and 9 ("5", "6", "PAT2") are printed, but when the "PAT2" in line 9 is found, the variable "c" is silently incremented from 1 to 2.

Lines 10 and 11 ("7" and "PAT2") are printed, but with the third "PAT2" in line 11, a new condition/action comes into force. The variable "c" is now 3, and when that happens AWK sets "f" to "0", turning off the flag.

With the flag off, lines 12-15 don't get printed.

Because the default action of AWK is to print the line that matches a condition, the "print" action after the "f" condition isn't actually needed:

awk '/PAT1/ {f=1} f; /PAT2/ {c++} c==3 {f=0}' file

secondscreen

I've put a semicolon after the "f" to tell AWK that an action has just been taken, and to separate the "f" condition from the "/PAT2/" condition. Please note that "f" and "c" are arbitrary choices; "garden" and "hippo" would work just as well:

thirdscreen

A second simplification I can make in the command is to fold the condition/action "{c++} c==3" into the condition "++c==3". Instead of asking Is "c" equal to 3 yet?, I ask Is the value of the expression "++c" equal to 3 yet? It's another AWK shorthand; simply asking the question means that "c" gets incremented. (See below for notes on "c++" and "++c")

thirdAscreen

So what's the unexpected AWK behaviour? It's based on AWK's range syntax. The construction "/start/,/end/" will print every line from the first appearance of "start" to the first appearance of "end", as shown here with "file":

fourthscreen

And here's the surprising command — with no flag!

awk '/PAT1/,/PAT2/ && ++c==3' file

fifthscreen

AWK is reading this command as if it were

awk '/PAT1/,(/PAT2/ && ++c==3)' file

or in more or less plain English, print the lines starting from the first appearance of "PAT1" to the line where "PAT2" appears and (&&) where "++c", which has been incrementing since "PAT2" first appeared, has a value of 3.

The reason for this behaviour is that the comma in the range syntax "has the lowest precedence of all the operators (i.e., it is evaluated last)" (quote from the GNU AWK manual), and the "&&" operator is evaluated before the range is defined.

Here's another example, using the file "list":

Chew a cherry
This mango is yummy
Can I have a mango?
apple strudel
apple and banana
banana and peach
half a pear
pears for dessert
Want a pear?

I want the lines from the second appearance of "mango" to the second appearance of "pear". The shorthand command (with no flags) is

awk '/mango/ && ++a==2 , /pear/ && ++b==2' list

sixthscreen

Update. For more on flags in AWK, like how to print from line A to line B but exclude line A, or line B, or both, see this BASHing data post.


c++ and ++c

The difference between these two expressions is sometimes confused. In AWK, "c++" is called a post-incremented expression. It says that 1 is being added to "c" whenever some condition is met. The pre-incremented expression "++c" means exactly the same thing: 1 is being added to "c" whenever some condition is met.

The difference is in the values of the expressions. Here's an example that makes this clear (I hope!). The file "A-list" consists of four lines with the letter "A" in each. I'll ask AWK to increment the count of "A"s on each line and print (for each line) first the value of the incrementing expression, then the value of the counting variable "c":

sixthscreen

As you can see, "c" increases by 1, line by line, in both cases. So do the incrementing expressions, but "++c" keeps up with "c" (it's "pre-incremented") while "c++" still has its previous value. It's for this reason that the value of "++c" is tested ("++c==3") in the commands above.

Note also the way that the incrementing is folded into the print command in the screenshot above, rather than having to be a separately specified operation. That's another case of shorthand or "idiomatic" AWK.


Last update: 2021-12-24
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License