banner

For a full list of BASHing data blog posts see the index page.  RSS


A quick cross-file comparison with AWK

I really like AWK. It allows me to do simple, effective, ad hoc processing of data files, as this post will demonstrate. If AWK was a football club I'd be an ardent supporter: "Carn the mighty AWK!"

"Carn the mighty Hawks!" is a rallying cry for the Hawthorn Football Club (Australian Rules, Australian Football League), a Melbourne team which for the past 20 years has called Tasmania its second home and which has won the AFL premiership four times in that period.

I had two data tables with a joining field in common. The first table assigned an ISO 8601 date to each unique entry in that field. The second table had an unrelated ISO 8601 date in each record. I needed to check whether any of the second table's dates were before the corresponding date in the first table, because "table2" dates were supposed to come after "table 1" dates.

The real-world tables had dozens of fields and thousands of records. Below are two vastly simplified versions. "table1" has a UID field and a date field. "table2" has its own UID field, the first table's UID field as foreign key, and its own (different) date field.

table1
 
ID1|date1
A01|2021-06-13
A02|2021-03-13
A03|2021-06-10
A04|2021-05-21
A05|2021-04-09
A06|2021-03-17
A07|2021-06-12
A08|2021-03-07
A09|2021-05-11
A10|2021-04-02

table2
 
ID2|ID1|date2
B01|A10|2021-05-27
B02|A09|2021-06-19
B03|A04|2021-05-20
B04|A10|2021-05-02
B05|A01|2021-07-23
B06|A04|2021-07-04
B07|A07|2021-07-30
B08|A07|2021-08-05
B09|A02|2021-04-13
B10|A01|2021-06-27
B11|A01|2021-07-17
B12|A03|2021-08-05
B13|A08|2021-04-05
B14|A05|2021-05-15
B15|A07|2021-07-25
B16|A07|2021-08-01
B17|A07|2021-07-09
B18|A04|2021-06-02
B19|A10|2021-05-05
B20|A10|2021-05-07
B21|A03|2021-06-27
B22|A02|2021-03-09
B23|A01|2021-07-25
B24|A09|2021-06-17
B25|A07|2021-06-08
B26|A09|2021-05-24
B27|A04|2021-06-10
B28|A10|2021-05-04
B29|A02|2021-04-03
B30|A02|2021-04-27

For those cases where the "table2" date is before the "table1" date, I want to print the "table2" UID, the "table1" date and the "table2" date, in that order. I do the job with this ad hoc AWK command:

awk -F"|" 'FNR==NR {a[$1]=$2; next} \
$3 < a[$2] {print $1 FS a[$2] FS $3}' table1 table2

awk1

AWK is told with -F"|" that the field separator in both tables is the pipe character.
 
The command uses the FNR==NR...next trick to work the first action on "table1" and the second action on "table2". The trick is explained in detail in this BASHing data post from 2019.
 
The first action, a[$1]=$2, builds an array "a" in which the index string is the ID1 entry in "table1" and the value string is the corresponding date1.
 
The second action works line by line on "table2". AWK first checks to see if the date2 field (field 3) is "less than" the value string in "a" corresponding to the ID1 in this table's field 2 ($3 < a[$2]). If so, AWK prints ID2, the value string (which is the date from "table1") and the "table2" date, separated by pipes (print $1 FS a[$2] FS $3).
 
The "less than" operation relies on the fact that ISO 8601 dates sort as strings, which coincidentally makes the sorting chronological. For example, "2021-05-20" is sorted by AWK before "2021-05-21", and "2021-07" after "2021-05-21".

In real-world data tables, one or both of the date entries might be missing. Since any ISO 8601 date is "greater than" an empty string in sort order, it's worth my while to add the condition that both dates actually exist. In the "table1", "table2" example the command would now be:

awk -F"|" 'FNR==NR && $2 {a[$1]=$2; next} \
$3 && $3 < a[$2] {print $1 FS a[$2] FS $3}' table1 table2


Last update: 2021-11-10
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License