banner

For a list of BASHing data 2 blog posts see the index page.    RSS


Post- and pre-incrementing (var++ and ++var) with AWK

Incrementing (++) adds one to an AWK variable, and in many situations it doesn't matter whether the "++" precedes the variable or follows it.

For example, the very small CSV "demo1" is just a,b,c. A "for" loop over the 3 comma-separated fields in "demo1" gives the same result with post- and pre-incrementing:

postpre1

For counting purposes there's likewise no difference. "demo2" (below) has "X" on 3 of its lines:

a
aX
a
a
aX
aX
a
a

If I use "c" as a counting variable to register when a line has "X" in it, the final count is the same whether "c" was post- or pre-incremented:

postpre2

If the pattern being looked for is missing, {print c} won't return anything. To get a count of "0", prefix the counting variable with the unary operator "+": {print +c}. Many thanks to Sundeep Agarwal for the tip on this "gotcha"!

There is, however, a very small processing advantage in using pre-incrementing for the count. When I concatenate "demo2" 160,000 times to build "bigdemo" with 1,280,000 lines and time that AWK command, the "++c" version runs a few percent faster than the "c++" version:

postpre3

So if post- and pre-incrementing aren't that different in their effect on a variable (they both just add one), how are they different?

They're different in their own values as expressions: the value of "var++" isn't the same as the value of "++var". The post-increment expression "var++" starts off (before adding one) with the original value of "var". The pre-increment expression "++var" starts off with the original "var" value plus one. To demonstrate I'll process "demo3" (see screenshot below) to count the lines with "a", but this time I'll print current values of "++c" or "c++" as well as "c":

postpre4

Notice that "c++" starts off with a value of zero, because at the time AWK finds the first line with "a", the count hasn't yet begun. The expression "++c" adds one to that original zero value. Both expressions, "c++" and "++c", change "c" the same way (by adding one for every find of "a"), but the value of "++c" keeps up with the count, while the value of "c++" is always one behind the count.

I haven't found many reasons in my data-processing work to favour one kind of incrementing over another, but in a BASHing data post six years ago I showed how pre-incrementing saves some typing. I'll demonstrate that trick again here by asking AWK to print the line number in "demo3" where the "a" count reaches 3, namely line 4:

postpre5

A logical way to do this is to set a counter going for lines with "a" ("c++"), and when the count reaches 3 ("c==3"), print the line number. An elegant method is to make "++c==3" a condition for printing the line number, because "++c" reaches 3 at line 4. The value of "c++", on the other hand, doesn't reach 3 until AWK processes the last line, line 5.


Last update: 2024-04-26
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License