For a full list of BASHing data blog posts, see the index page.     RSS

Finding changepoints in a list, revisited

Back in 2013 I wrote a short article for Free Software Magazine about using the command line to find changes in a list. The example I used is shown below; it's here called "fileA":

Brown,A,"446 Wivern Hwy",Atn,"9551 2616"
Brown,BD,"49 Burnett St",Rmd,"9551 3613"
Brown,BG,"18 The Passage",Spr,"9551 9048"
Brown,F,"67 Richmond Rd",Atn,"0422 561 389"
Brown,JA,"55 Burnett St",Rmd,"9551 6350"
Brown,JJ,"441 Tennyson Ave",Spr,"9551 3663"
Brown,JS,"61 Richmond Rd",Atn,"9551 6737"
Brown,R,"33 Coalfield Rd",Wav,"9551 3477"
Brown,W,"95 Underwood St",Spr,"0422 113 777"
Browne,C,"4 Ellington Cres",Spr,"9551 3305"
Browne,F,"265 Crown Rd",Wav,"9551 3039"
Browne,SH,"71 Skyline Dr",Und,"0422 840 211"
Browning,B,"108 Market Ave",Wav,"9551 6942"
Browning,CE,"106 Market St",Wav,"9551 8763"
Browning,G,"183 Kent St",Und,"9551 7418"
Browning,GR,"33 Marshall Ave",Und,"0422 565 719"
Browning,H,"24a Archer St",Rmd,"0422 888 470"
Browning,RD,"3a Archer St",Rmd,"9551 4112"
Browning,V,"77 Botany St",Spr,"9551 7485"
Brownley,C,"12 King St",Wav,"9551 7619"
Brownley,E,"314 Litchfield St",Spr,"9551 1624"

As you can see, this list of contact details is sorted alphabetically by person's name. Is there an easy way to find the changepoints where Brown becomes Browne, Browne becomes Browning, and Browning becomes Brownley?

My 2013 solution is embarrassing. I didn't know much about AWK at the time and I used a chain of 6 different commands. A pure-AWK solution is simpler:

awk -F"," '$1 != a && f {print b"\n"$0"\n"} {a=$1; b=$0; f=1}' fileA


The field separator is set to a comma (-F",") and AWK proceeds line by line through the file. It first checks to see if the first field (surname) is not equal to the variable "a" and also if a flag "f" is set. That doesn't make much sense in the first line, so AWK goes to the second action, which is to set "a" to the value of the first field and "b" to the value of the whole line, and also to set a flag "f" to "true" (f=1).
On the second line the first field is equal to "a", so the first action is skipped and the second action is repeated, refreshing "a" and "b". At line 10, the first field isn't equal to "a" and the flag is set, so AWK prints the previous line (stored in "b"), a newline, the current line and another newline, before repeating the second action.

A tweak of the command prints the line-number pairs where the surname changes:

awk -F"," '$1 != a && f {print b"/"NR} {a=$1; b=NR; f=1}' fileA


A practical application of changepoint-finding might be in time series, where you're looking for abrupt changes over time. In the tab-separated "fileB", below, there are 2 abrupt shifts in the recorded value:

awk -F"\t" '$2 != a && f {print b"\n"$0"\n"} {a=$2; b=$0; f=1}' fileB


Last update: 2018-12-06
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License