For a full list of BASHing data blog posts, see the index page.

Partial duplicates

In the business world, duplicate records are only rarely true duplicates. More often they're multiple records for the same customer (for example) with spelling differences in the customer's name, or differences in the formatting of the customer's address.

Finding these "pseudoduplicates" can be difficult. OpenRefine might help, and there are many commercial services which offer deduplication at a price.

Another way to view pseudoduplicates is to see them as partial duplicates. Instead of looking for parts of a record that might be the same as parts of another, you could screen for combined parts of a record that are the same, but perhaps shouldn't be.

Here's an example. The 19 records in the tab-separated table below (called "file") are each unique, even if we ignore the record ID field (see screenshot).

IDField1	Field2	Field3	Field4	Field5
10001	be-29	b2-4aae-9	e61-d4	d019adef79
10002	7b-ea	6a-426e-b	ae2-45	ed2c358c6a
10003	39-cf	d4-4318-8	75c-e7	cbe71417f2
10004	42-ab	7b-4947-9	6fd-10	36c98a1235
10005	db-3f	71-49a9-b	24c-5a	1fca079c31
10006	9f-af	f0-4df5-8	ea3-b5	a092755912
10007	d3-1c	64-4a57-b	aa0-68	804191270a
10008	be-17	c8-4d7e-9	7ee-ec	55168faf14
10009	b6-47	41-4e3d-b	681-4f	cbe71417f2
10010	b4-07	9e-47e8-b	e1e-0e	a9999a8cd7
10011	9c-f8	bf-4da7-a	400-25	19042a98df
10012	c8-8f	21-4239-b	fea-2b	82d21a1cbd
10013	ae-aa	d7-4204-a	351-c1	851871300b
10014	cd-b5	71-49a9-b	c6d-eb	97b49401b9
10015	34-ba	d4-4318-8	df9-ab	cbe71417f2
10016	af-c2	d1-4dfd-a	589-87	edfbe7d380
10017	a8-36	71-49a9-b	a15-c2	1fca079c31
10018	ff-78	71-49a9-b	701-9e	6a8470122b
10019	89-bb	d4-4318-8	f9d-60	cbe71417f2

I know, however, that there are partial duplicates lurking in this file, in the combined fields 3 and 5. The equivalent in a customer database might be "same address and phone number, but differently spelled name". To find these partial duplicates, I use one of two AWK commands. One reads the file twice and doesn't use much memory, and is suited to files of any size. The other command reads the file once; with big files it uses a lot of memory.

The two-pass command puts the field 3/field 5 combination in an array during the first pass through the file, and counts the number of occurrences of each combination. In the second pass, AWK matches the field 3/field 5 combination in the current line to the array, and prints that line if the array count is greater than 1:

awk -F"\t" 'FNR==NR {a[$3,$5]++; next} a[$3,$5]>1' file file

A follow-up sorting organises the output more neatly:

awk -F"\t" 'FNR==NR {a[$3,$5]++; next} a[$3,$5]>1' file file \
| sort -t $'\t' -k3 -k1

The one-pass AWK command puts every line into the array 'a', indexed by the field 3/field 5 combination. If the current line's fields 3 and 5 are already in 'a', the corresponding stored line is set equal to a variable 'b', and 'b' is printed followed by the line currently being checked. The next time there's a successful check of the array, 'b' is reset.

awk -F"\t" 'b=a[$3,$5] {print b"\n"$0} {a[$3,$5]=$0}' file

The output of the one-pass command needs uniquifying as well as sorting:

awk -F"\t" 'b=a[$3,$5] {print b"\n"$0} {a[$3,$5]=$0}' file \
| sort -t $'\t' -k3 -k1 | uniq

In a real-world case recently, I had to modify the two-pass command because many of the records had blanks in the two fields being checked, which made them "the same". I checked for blanks after noticing that the count of partial duplicates was suspiciously large:

The modification was to add another condition to the second part of the AWK command, namely that field 2 isn't blank:

The two pseudoduplicated records have the same collection codes ("NM" in field 17) and catalog numbers ("E239" in field 2). Apart from different strings for the unique record ID (field 1) and the language of the institution name (field 16, "Národní muzeum" vs "National Museum of the Czech Republic"), the records are identical.

The 2-pass and 1-pass AWK commands make finding partial duplicates easy. The hard part of the job is deciding which fields to check!

To avoid having to write the two-pass command from scratch every time I use it, I've saved its skeleton in the function "coldupes", which uses the xclip utility:

coldupes() { echo -en "awk -F\"\\\t\" \x27FNR==NR {a[]++; next} \
a[]>1\x27 $1 $1 | sort -t \$\x27\\\t\x27 -k" | xclip; }

I enter "coldupes" and the filename in a terminal and press Enter. I then middle-click-paste at the next prompt to get the skeleton command, ready for adding the fields to check and the sort instructions:

Last update: 2018-07-14
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License