banner

For a full list of BASHing data blog posts see the index page.     RSS


There's data missing - please explain

Back in 2018 I blogged about NITS (Nothing Interesting To Say). NITS are the sometimes strange bits of text that people enter in databases when a data item is missing, like "???", "-" and "[ ]".

But the question

As a data enterer, what should I do about missing data?

doesn't have a simple answer, and the database software might only help a little. For example, MySQL users enter "NULL" whenever a data item doesn't have a known value. NULL gets special treatment in MySQL: it's ignored, or understood to equal FALSE. An empty string can also be a data item in MySQL, and it's treated differently from NULL. An empty string is known to be a blank, while NULL isn't known to be anything.

OK, that's NULL vs blank, but does blank really mean blank? Below are some possible origins of a blank value (empty string) for a data item. In each case, the data enterer has intentionally left the item blank.

  1. The value is known but hasn't been entered yet, for a legitimate reason
  2. The value is known to be unknown, and will never be known
  3. The value is not known at present, but might be in future
  4. There are several possible values, but the database only allows one value to be entered

If you're a database designer, you could argue that these cases could be eliminated if the fields were defined or organised differently. That might be true, but it doesn't help the data enterer faced with an existing database, or a data auditor like me, who only sees the data after entry.

My own preference is that missing data should always appear as a blank. Whatever the reason for the missingness, missingness is perfectly well represented by blankness. The alternative is to enter a text string that might be clearly understandable ("Unknown, can't be determined") but also might be ambiguous ("N/A"). Alternatives like these make data processing and data checking more difficult. They also run counter to the idea that a field should contain just one kind of information. In a date field, "unknown" isn't a date.

So how can a data enterer additionally give the reason for a blank? An all-purpose method is to use another field in the database that allows remarks about the field where blanks might appear. Here I might find (for the four possibilities listed above):

  1. Still on paper, will be entered soon
  2. Unknown, can't be determined
  3. Currently unknown, being investigated
  4. Known to be either A or B, not sure which

A choice like this is included in the Darwin Core system that is now widely used for organising and databasing biodiversity data. Darwin Core has fields for eventRemarks, georeferenceRemarks, identificationRemarks, locationRemarks, measurementRemarks, occurrenceRemarks, organismRemarks, relationshipRemarks and taxonRemarks. Most of these fields refer to more than one data field. "Remarks" fields like these allow a clear separation between data in the related fields and explanations for missing data in those fields.

Notice that I haven't recommended the RTFM solution, which is to ask data enterers to follow certain rules for data entries:

If you don't know the date, enter "unknown".
If the date was never recorded, enter "n.d.".
If the date is ambiguous, enter "uncertain".
If the date...

From experience, I know that entry rules don't work very well. Values other than the allowed ones get entered, and data enterers get confused about which rule to follow.

My advice that "missing data should always appear as a blank" obviously doesn't apply to databases with validation rules that don't allow blanks, or to databases with well-defined possible entries for missing data. I'm also aware that many people will disagree with me. For example, in an excellent recent overview of the databases in use in European museums and herbaria, the authors write:

There is a difference between information that is not known, information that is known to be unknown, and information that is known but not yet entered in digital form (digitized). Sometimes such cases can be determined by the use of values such as "unknown", "S.C.", (???), none, etc. but often these three cases cannot be distinguished from one another because relevant fields are left empty or are absent altogether. We recommend that best practice is to use a standardized value for unknown data for the case where information is known not to be known, and a value of 'empty' for the other two cases. Fields without any value should be interpreted as 'empty'.

This recommendation covers my first three numbered items (above). The authors suggest "unknown" for (2), which is "never will be known" and blank for (1) and (3), which are both "currently not known".

Item (4), an ambiguous entry, is an especially challenging explanation for a blank. Although some databases might be tweaked to allow alternative values for a data item, the data interchange formats used to move data from one database to another aren't so flexible. Markdown Syntax for Object Notation (MSON) might work for (4), but I haven't tried it.


Last update: 2021-06-30
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License