For a full list of BASHing data blog posts see the index page.     RSS

Checking DIY primary/foreign key relationships

The different tables in a (relational) database are linked by what are usually called primary keys and foreign keys.

In many of the datasets I audit, one of the tables is called event.txt and contains details of when, where, how and by whom some animal or plant was collected. There are no duplicate events in event.txt; each record is unique. A field called "eventID" contains a unique identifying code for each record. The "eventID" field is the primary key in the table event.txt.

Another table in the dataset is occurrence.txt. This table has details of the animal or plant, such as the name of the herbarium or museum where a specimen of the plant or animal is kept, together with its catalog number, scientific name and other facts about the specimen. occurrence.txt also has a primary key field, called "occurrenceID", which uniquely identifies each record in the table.

What occurrence.txt doesn't have is all the information about the collecting event. Instead, each record in occurrence.txt contains an "eventID" field which uses the unique identifying code from event.txt. The "eventID" field in occurrence.txt is a foreign key, and its presence means the two tables can be related by database software. For any occurrence in occurrence.txt, the software can look up in event.txt the event details represented by "eventID".

All of this is Databasing 101 stuff. Why am I explaining it here?

Because in some of the datasets I audit, the primary and foreign keys don't come from a database. They're heading to a database and were hand-built, often in Excel spreadsheets. Some of the resulting stuff-ups:

  1. "eventID" has duplicates in event.txt
  2. "occurrenceID" has duplicates in occurrence.txt
  3. "eventID" is different in form in event.txt and occurrence.txt
  4. occurrence.txt has "eventID" entries that aren't in event.txt
  5. some "eventID" entries in event.txt aren't found in occurrence.txt
    (they all should be, in these datasets)

The first two possibilities are easily checked with shell tools. Suppose "eventID" is field 2 in event.txt and "occurrenceID" is field 2 in occurrence.txt. If there are duplicates, this pair of commands will find them in tab-separated tables:

cut -f2 event.txt | sort | uniq -D
cut -f2 occurrence.txt | sort | uniq -D
(If the tables have a field separator other than a tab, it can be specified with cut's -d option.)

Let's suppose "eventID" is field 2 in event.txt and field 10 in occurrence.txt. The other three stuff-up possibilities can be checked with a single command:

paste <(cut -f2 event.txt | sort) <(cut -f10 occurrence.txt | sort | uniq) \
| awk -F"\t" '$1 != $2' | wc -l

This is an "Is everything OK?" check. If it returns "0", all is well. If it returns a non-zero number, there's a problem. Repeating the command and replacing the wc -l on the end with less will allow you to page through the mismatches between the "eventID" lists from event.txt and occurrence.txt. That should give you a clue as to where the problem lies, and form differences in "eventID" (problem 3) will be easily spotted.

I use an AWK array to look for problems 4 and 5. In the hypothetical example, remember that

The following commands assume the two tables are tab-separated (-F"\t"):

"eventID" in occurrence.txt that isn't in event.txt?
awk -F"\t" 'FNR==NR {a[$2]; next} !($10 in a) {print $2 FS $10}' \
event.txt occurrence.txt
"eventID" in event.txt that isn't in occurrence.txt?
awk -F"\t" 'FNR==NR {a[$10]; next} !($2 in a) {print $2}' \
occurrence.txt event.txt

I first noticed errors in primary/foreign key relationships in a set of CSV tables with many other data problems, and with a strong residual aroma of Excel (Windows line endings and extra commas). Since they're quick and easy to perform, the checks above have now become part of my auditing routine.

Do I trust data compilers to get their primary and foreign keys right? Nope.

Last update: 2020-09-02
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License