banner

For a full list of BASHing data blog posts see the index page.     RSS


How to find almost-duplicates

A Data Cleaner's Cookbook explains how to find partial duplicates in a data table using an AWK array.

Here's an example using the tab-separated table "demo" (copy/paste-able as TSV):

Fld1Fld2Fld3Fld4Fld5Fld6Fld7Fld8
0017b03020d71b743c48ffdf9352b102e2d
002521a1da1f9eb42689fa8fc73570a31b8
003e6c30e9bdc9f448bb1c47705ca772ab5
00436cffd590c624eb682d1e30076ecedbd
00515c5787433dc4b20b1c47a1f3b8465b0
006b3fb5bad33614259a5b030370c953333
00715c5c3d533dc4b20b1c47a1f3b8465b0
0087b037686d2644c34b0e43646075af668
0097ee18a535cc54f579cf5ddc73556eee8
010bd75332421mz41b0b1bc22964ab9f2a3
01115d71fb272234e8f8f1f8e6b76f60cd1
012c3ccef6c70fb4a459428f00f7307e92d
0139ab4991c0bd74f3cbadfee145b5a6d17
0146ad5839519aa43c49cea3a3c90e84150
015607c37538a6944bfb41fddb1eb4a42ff
0167b03f06771b743c48ffdf9352b102e2d
01705f189f350676712b1c43b52454c4e35
018e20d534671a84b26a31dab914de39049
01915c552bd33dc4b20b1c47a1f3b8465b0
020e917b87908dd4387b520814a8a10717b

Fields 1 and 3 of "demo" contain only unique values. Ignoring those two fields, are there any duplicate records in the table? The Cookbook-style command would be:

awk -F"\t" 'FNR==NR {a[$2,$4,$5,$6,$7,$8]++; next} a[$2,$4,$5,$6,$7,$8]>1' demo demo | sort -t $'\t' -k2 -nk1

almost1

Fields in "demo" are tab-delimited (-F"\t"). The FNR==NR condition tells AWK to do the first action on the first file, "demo", and the second action on the second file, also "demo", as explained here.
 
In the first action, the array "a" is filled using the strung-together fields 2 and 4-8 as index strings (a[$2,$4,$5,$6,$7,$8]), and the count of each unique combination of fields as value string (++). After each line has its fields added to the array (or merely counted if already in the array), AWK moves to the next line (next).
 
In the second action, AWK checks the value string for the array member corresponding to the fields in the current line. If the value is greater than 1 (meaning there's a duplicate), the line is printed, which is AWK's default action.
 
All the almost-duplicate lines found by AWK are passed to sort, which in this case sorts first by the second tab-separated field (field 2), then by the serial number in field 1.

It works, but there's a convenience issue here. I had to write out all the fields except 1 and 3 for the array index. Wouldn't it be simpler to write something like this?

{a[expression for all fields except $1 and $3]++; next} ...

It would be simpler, but unfortunately there's no such expression in GNU AWK. You can't invert a selection of fields. In a recent data audit the issue loomed a lot bigger for me than it does in "demo": there were two unique-entry fields and 139 others, and I was looking for the "almost-duplicates" in the 139 non-unique-entry fields! (See the end of this post.)

It's easy enough to determine whether "almost-duplicates" actually exist in a table using cut, sort and uniq. Here's that process using "demo":

cut -f1,3 --complement demo | sort | uniq -D

almost2

Unfortunately this method doesn't return the whole record, so you don't know where in the table these "almost-duplicates" are lurking.

An interesting workaround is based on replacing the unique-entry field contents with the "zero or more of any one character" regex expression ".*", then getting a list of duplicate records, uniquifying it and searching for the unique records in the original table. As with the AWK array method, two passes through the table are required.

In the first pass through "demo", AWK does the replacement of field 1 and field 3 contents with ".*". The result is piped to sort and uniq, with uniq's -d option returning just one example of each of the duplicates:

awk 'BEGIN {FS=OFS="\t"} NR>1 {$1=$3=".*"; print}' demo | sort | uniq -d

almost3

AWK needs to be told that both the input and output field separator is a tab (BEGIN {FS=OFS="\t"}) because each line will be rebuilt in the action part of the command. The action applies to lines after the header (NR>1) and consists of setting the values of fields 1 and 3 to the string ".*", then printing the line.
 
The "-d" option for uniq isn't as familiar to many users as the "-D" option. "-D" prints all duplicate lines (only). "-d" also prints only duplicate lines, but only one example from each group of duplicates.

Now to search for the two records above, which have the regex expression ".*" in place of unique field 1 and field 3 entries. I can do that either with grep -f, using as a reference file a redirection from the first pass, or with a while loop to read the redirection one line at a time and pass it to AWK for matching in "demo":

grep -f <(awk 'BEGIN {FS=OFS="\t"} NR>1 {$1=$3=".*"; print}' demo | sort | uniq -d) demo | sort -t $'\t' -k2 -nk1
 
while read line; do awk -v FIND="$line" '$0 ~ FIND' demo; done < <(awk 'BEGIN {FS=OFS="\t"} NR>1 {$1=$3=".*"; print}' demo | sort | uniq -d) | sort -t $'\t' -k2 -nk1

almost4

In the while loop command, each line in the list of duplicates is fed to AWK as the shell variable "$line" and replaced with the AWK variable FIND (-v FIND="$line"). As AWK goes through "demo" line by line, it checks to see if the current line is a regex match for the line from the list of duplicates ($0 ~ FIND), and if it is, the current line is printed. Each of the unique entries in fields 1 and 3 is matched by ".*".

Both these methods work, but they fail spectacularly if there are any strings in the "almost-duplicates" lines that grep or AWK can interpret as regex (other than ".*"). There aren't any such strings in "demo", but there were lots in my audit table, and I had to fall back on the array method (see top of page). The array was indexed with

$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,
$21,$22,$23,$24,$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,
$38,$39,$40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,
$55,$56,$57,$58,$59,$60,$61,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,
$72,$73,$74,$75,$76,$77,$78,$79,$80,$81,$83,$84,$85,$86,$87,$88,$89,$90,
$91,$92,$93,$94,$95,$96,$97,$98,$99,$100,$101,$102,$103,$104,$105,
$106,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$118,
$119,$120,$121,$122,$123,$124,$125,$126,$127,$128,$129,$130,$131,
$132,$133,$134,$135,$136,$137,$138,$139

where the excluded unique-entry fields were 1 and 82. AWK processed the audit table (139 fields, 810000+ records) in less than 15 seconds on my desktop PC (quad core i5, 8G RAM).

The index string formula for the array comes from echo "\$$(seq -s ",$" 139)", from which I manually deleted "$1," and "$82,".


Last update: 2020-07-01
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License