banner

For a full list of BASHing data blog posts, see the index page.     RSS


Drugs on the command line

There's a lot of raw material on the Web for data auditors to tinker with, but I've found only one website that advertises Datasets for data cleaning practice. It's a 2018 blog post by computational linguist Rachael Tatman. Among the offerings is a link to the National Drug Code Directory website of the US Food and Drug Administration, and one of the FDA downloadables there contains a table with 123,841 product records (2018-12-28 version).

The product table is plain text and tab-separated, but it's in windows-1252 encoding with a Windows carriage return at the end of each line. (Sigh.) I deleted the carriage returns and converted the table to UTF-8 as the file "prods0".

Tatman writes "Issue: Non-trivial duplication (which drugs are different names for the same things?)"

Answering Tatman's question isn't straightforward, because the product table contains partially duplicated records. Although each record has a unique product ID, if that ID is ignored there's a set of more than 1100 duplicates:

dupes1

Duplicate pair example from "prods0", with PRODUCTID in red:
 
0009-0039_5e394712-e775-435b-a4e0-32e1d9647ff5 0009-0039 HUMAN PRESCRIPTION DRUG SOLU-MEDROL methylprednisolone sodium succinate INJECTION, POWDER, FOR SOLUTION INTRAMUSCULAR; INTRAVENOUS 19590402 NDA NDA011856 Pharmacia and Upjohn Company LLC METHYLPREDNISOLONE SODIUM SUCCINATE 40 mg/mL Corticosteroid [EPC],Corticosteroid Hormone Receptor Agonists [MoA] N 20191231
 
0009-0039_95289567-4341-4b6c-bc3c-aa13036bc9b4 0009-0039 HUMAN PRESCRIPTION DRUG SOLU-MEDROL methylprednisolone sodium succinate INJECTION, POWDER, FOR SOLUTION INTRAMUSCULAR; INTRAVENOUS 19590402 NDA NDA011856 Pharmacia and Upjohn Company LLC METHYLPREDNISOLONE SODIUM SUCCINATE 40 mg/mL Corticosteroid [EPC],Corticosteroid Hormone Receptor Agonists [MoA] N 20191231

I cut away the unique ID and sorted and uniquified the records to build the file "prods1", retaining the header line:

cat <(cut -f1 --complement prods0 | head -n 1) <(tail -n +2 prods0 | cut -f1 --complement | sort | uniq) > prods1

Next, I focused in "prods1" on the fields SUBSTANCENAME (field 13), ACTIVE_NUMERATOR_STRENGTH (14) and ACTIVE_INGRED_UNIT (15). The FDA's explainer page describes these as follows:

SubstanceName
This is the active ingredient list. Each ingredient name is the preferred term of the UNII code submitted.
 
StrengthNumber [older field name?]
These are the strength values (to be used with units below) of each active ingredient, listed in the same order as the SubstanceName field above.
 
StrengthUnit [older field name?]
These are the units to be used with the strength values above, listed in the same order as the SubstanceName and SubstanceNumber.

If these 3 fields are the same, then the product is the same so far as the active ingredients are concerned. To find these partial duplicates I used the two-pass method described in a previous BASHing data post:

awk -F"\t" 'FNR==NR {a[$13,$14,$15]++; next} $13 != "" && $14 != "" && $15 != "" && a[$13,$14,$15]>1' prods1 prods1 | wc -l

dupes2

Wow! That's a lot of "same product" out of 123,205 unique product records. To investigate further I added the fields STARTMARKETINGDATE (field 8 in prods1), PROPRIETARYNAME (3), PROPRIETARYNAMESUFFIX (4) and LABELERNAME (12) to a print as the new file "prods2" (no header this time).

awk -F"\t" 'FNR==NR {a[$13,$14,$15]++; next} $13 != "" && $14 != "" && $15 != "" && a[$13,$14,$15]>1 {print $8 FS $3 FS $4 FS $12 FS $13 FS $14 FS $15}' prods1 prods1 > prods2
 
-----
 
StartMarketingDate
This is the date that the labeler indicates was the start of its marketing of the drug product.
 
ProprietaryName
Also known as the trade name. It is the name of the product chosen by the labeler.
 
ProprietaryNameSuffix
A suffix to the proprietary name, a value here should be appended to the ProprietaryName field to obtain the complete name of the product. This suffix is often used to distinguish characteristics of a product such as extended release (“XR”) or sleep aid (“PM”). Although many companies follow certain naming conventions for suffices, there is no recognized standard.
 
LabelerName
Name of Company corresponding to the labeler code segment of the ProductNDC.

That still apparently doesn't capture all the variation in FDA's database, because "prods2" contains a lot of exact duplicates

dupes3

and there don't seem to be any differences between the duplicated records in the original downloaded table ("prods0"), apart from the FDA product code and the unique ID based on that code:

Example from "prods0":
 
17518-080_7aa3171b-36c0-48d6-e053-2991aa0a6aec 17518-080 HUMAN OTC DRUG 3M SoluPrep chlorhexidine gluconate and isopropyl alcohol SOLUTION TOPICAL 20181008 NDA NDA208288 3M Company CHLORHEXIDINE GLUCONATE; ISOPROPYL ALCOHOL 20; .7 mg/mL; mL/mL N 20191231
 
17518-081_7aa3171b-36c0-48d6-e053-2991aa0a6aec 17518-081 HUMAN OTC DRUG 3M SoluPrep chlorhexidine gluconate and isopropyl alcohol SOLUTION TOPICAL 20181008 NDA NDA208288 3M Company CHLORHEXIDINE GLUCONATE; ISOPROPYL ALCOHOL 20; .7 mg/mL; mL/mL N 20191231

Once again I sorted and uniquified, converting "prods2" to "prods3", which has 92,452 records. One source of duplication in "prods3" is in the proprietary name suffix field, because the same basic product can be sold with slightly different formulations not affecting the active ingredients. Here's an example (from "prods0") — a dental fluoride paste that comes in 3 different flavours:

65222-401_6155acd9-8ec2-a87d-e053-2991aa0a7b43 65222-401 HUMAN PRESCRIPTION DRUG Nupro Fluorides NaF Oral Solution Mint Sodium Fluoride GEL DENTAL 19000101 UNAPPROVED DRUG OTHER Dentsply LLC. Professional Division Trading as "DENTSPLY Professional" SODIUM FLUORIDE 20 mg/g N 20191231
 
65222-411_6155acd9-8ec2-a87d-e053-2991aa0a7b43 65222-411 HUMAN PRESCRIPTION DRUG Nupro Fluorides NaF Oral Solution Mandarin Orange Sodium Fluoride GEL DENTAL 19000101 UNAPPROVED DRUG OTHER Dentsply LLC. Professional Division Trading as "DENTSPLY Professional" SODIUM FLUORIDE 20 mg/g N 20191231
 
65222-421_6155acd9-8ec2-a87d-e053-2991aa0a7b43 65222-421 HUMAN PRESCRIPTION DRUG Nupro Fluorides NaF Oral Solution Apple Cinnamon Sodium Fluoride GEL DENTAL 19000101 UNAPPROVED DRUG OTHER Dentsply LLC. Professional Division Trading as "DENTSPLY Professional" SODIUM FLUORIDE 20 mg/g N 20191231

A larger source of duplication in "prods3" is the marketing date. Walgreens, for instance, has 2 different registrations for an allergy medicine, differing only in start and end marketing dates (example from "prods0"):

0363-0211_997032ac-0110-4004-9697-d82146ba7128 0363-0211 HUMAN OTC DRUG 24 Hour Allergy Cetirizine HCl CAPSULE ORAL 20130301 NDA NDA022429 Walgreens CETIRIZINE HYDROCHLORIDE 10 mg/1 N 20181231
 
0363-1219_f3168f2c-27a7-4dd7-9770-e91443a580f1 0363-1219 HUMAN OTC DRUG 24 Hour Allergy Cetirizine HCl CAPSULE ORAL 20180914 NDA NDA022429 Walgreens CETIRIZINE HYDROCHLORIDE 10 mg/1 N 20191231

I generated "prods4" from "prods3" by cutting out marketing date and sorting and uniquifying again. That reduced the set of "basically the same product" records to 84,163. Here are the top 10 formulations:

cut -f4-6 prods4 | sort | uniq -c | sort -nr | head

dupes4

Those 25 mg lots of diphenhydramine hydrochloride (an antihistamine) were sold by a nominal 188 labelling entities, but again there's duplication. The FDA lists multiple strings for what's presumably the same company

Allergy relief     [no suffix]     Topco Associates LLC
Allergy relief     [no suffix]     Topco Associates, LLC
Allergy relief     [no suffix]     TopCo Associates LLC

the same product

Sleep Aid     Nighttime     CVS Pharmacy
Sleep- Aid     Nighttime     CVS Pharmacy
Sleep-Aid     Nighttime     CVS Pharmacy

or both

sleep aid     nighttime     Target Corporation
Sleep Aid     NightTime     TARGET Corporation

Summing up, the answer to Tatman's question about this dataset, namely "Which drugs are different names for the same things?" has several answers depending on how you define "things". But even after you've decided what you're looking for, the surprising messiness of the FDA's data means you have a lot of data cleaning to do before you can start looking. The FDA's product table is indeed a good dataset for data cleaning practice!


Some of the ingredient fields in the product table contain semicolon-and-space-separated strings, like

ACETALDEHYDE; ARSENIC TRIOXIDE; BALSAM PERU; OYSTER SHELL CALCIUM CARBONATE, CRUDE; PHENOL; CONIUM MACULATUM FLOWERING TOP; COUMARIN; SAFFRON; HISTAMINE DIHYDROCHLORIDE; LACHESIS MUTA VENOM; LYCOPODIUM CLAVATUM SPORE; PHOSPHORUS; SEPIA OFFICINALIS JUICE

Could there be additional duplication in these entries, with the same items listed in different orders in different records? Checking for different orders of items within a single field is an interesting exercise in data auditing: see the next BASHing data post.


Last update: 2019-01-06
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License