banner

For a full list of BASHing data blog posts see the index page.  RSS


Duplicate records differing only in unique identifiers

There's a big data table with lots of fields and lots of records. Each record has one or more unique identifier field entries. How to check for records that are exactly the same, apart from those unique identifiers?

I've been tinkering with this problem for years, and you can read my last, fairly clumsy effort in this BASHing data blog post from 2020. Here I present a much-improved solution, which has also gone into A Data Cleaner's Cookbook as an update.

In 2020, the fastest and most reliable method I used to extract these partial duplicates was with an AWK array and two passes through the table. In the first pass, an array "a" is built with the non-unique-identifier field entries as index string and the tally of each different entry as the value string. In the second pass through the table, AWK looks for records where the value string for the same index string is greater than one, and by default prints the record.

The example table from the 2020 blog post was the TSV "demo":

Fld1Fld2Fld3Fld4Fld5Fld6Fld7Fld8
0017b03020d71b743c48ffdf9352b102e2d
002521a1da1f9eb42689fa8fc73570a31b8
003e6c30e9bdc9f448bb1c47705ca772ab5
00436cffd590c624eb682d1e30076ecedbd
00515c5787433dc4b20b1c47a1f3b8465b0
006b3fb5bad33614259a5b030370c953333
00715c5c3d533dc4b20b1c47a1f3b8465b0
0087b037686d2644c34b0e43646075af668
0097ee18a535cc54f579cf5ddc73556eee8
010bd75332421mz41b0b1bc22964ab9f2a3
01115d71fb272234e8f8f1f8e6b76f60cd1
012c3ccef6c70fb4a459428f00f7307e92d
0139ab4991c0bd74f3cbadfee145b5a6d17
0146ad5839519aa43c49cea3a3c90e84150
015607c37538a6944bfb41fddb1eb4a42ff
0167b03f06771b743c48ffdf9352b102e2d
01705f189f350676712b1c43b52454c4e35
018e20d534671a84b26a31dab914de39049
01915c552bd33dc4b20b1c47a1f3b8465b0
020e917b87908dd4387b520814a8a10717b

Fields 1 and 3 of "demo" contain only unique values. Ignoring those two fields, are there any duplicate records in the table? The 2020 command was:

awk -F"\t" 'FNR==NR {a[$2,$4,$5,$6,$7,$8]++; next} \
a[$2,$4,$5,$6,$7,$8]>1' demo demo

fix1

And the output can be sorted to suit.

The method works fine, but what if there are one or two unique identifier fields and 100+ other fields? It's tedious to build an index string like $2,$4,$5,$6,$7,$8....

In my improved method, I don't have to build that string at all. Instead I first set the unique identifier field entries to the arbitrary value "1" in the first pass through the table, then index the just-modified record as $0. In the second pass, I first assign the whole, unmodified record to the variable "x", then repeat setting the unique identifier fields to "1". If the value of the array with the modified record as index is greater than 1, I print "x":

fix2

awk -F"\t" 'FNR==NR {[Set UI field(s) = 1; a[$0]++; next} \
{x=$0; [Set UI field(s) = 1]} a[$0]>1 {print x}' table table

The unique identifier field entries can be set to any arbitrary value. I use "1", but the empty string ("") and the string "hippopotamus" (with quotes) would work just as well. The aim is to make those unique identifiers non-unique.

It's a simple change in the code, but it saves a lot of fiddling!


Update. An elegant one-pass variation of this command was suggested by Janis Papanagnou in 2023. The output is sets of partial duplicate records with each set separated by a blank line:

awk -F"\t" '{x=$0; [Set UI field(s) = 1]; y=$0; a[y]=a[y] RS x; b[y]++} END {for (i in a) if (b[i]>1) print a[i]}' table

In the screenshot below, the CSV "demo" has unique IDs in fields 1 and 3:

fix3

Last update: 2023-08-19
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License