banner

For a full list of BASHing data blog posts see the index page.     RSS


How to be uncertain with dates

ISO 8601 is "an internationally accepted way to represent dates and times using numbers". The ISO 8601 date format YYYY-MM-DD is simple and has the blessing of (among other international bodies) the World Wide Web Consortium (W3C). ISO 8601 dates can be built and parsed in most programming languages and can be easily validated as regex or with AWK.

ISO 8601 dates have a limited amount of flexibility. If you only know the year and month, you can write YYYY-MM, and if you only know the year, YYYY. The three forms can all be used in interval dates, too, as start-date/end-date. For example, 2019-08-16/2019-10 means from 16 August 2019 to sometime in October 2019.

In real-world databases, ISO 8601 corruptions can be found when year, month or day aren't known for certain, as with YYYY-MM-00, YYYY-00-00 and 0000-00-00. These appear where the software user (or developer) wants the date to fit the YYYY-MM-DD pattern, even though there is no day "00", no month "00" and no year "0000" in the Gregorian calendar on which ISO 8601 is based.

I also see date strings like 1998-05? and ?1998-05 in my data auditing. These are usually examples of domain schizophrenia; the field is being used to contain more kinds of information than it is supposed to contain. It would be better to have 1998-05 in the date field and a separate field for date notes, with a comment like The year is uncertain or The month is uncertain or The month and year are uncertain.

To deal with date uncertainties (and other date/time issues), the International Organization for Standardization issued extensions to ISO 8601 in 2019. The documentation is locked behind a very high paywall, but a draft version was freely available as a PDF in 2016.

Helpfully, the relevant date features were published online by the US Library of Congress in 2019 as their Extended Date Time Format (EDTF, here), and it's the EDTF that I've noticed being implemented in a digital data repository and a Python library.

What follows is an overview of some of the EDTF features. These are divided into categories of uncertainty and further divided into the EDTF's two levels of implementation, "Level 1" and "Level 2". If you aren't using the new extensions you're at "Level 0".


Uncertain ("?"). At Level 1, a "?" at the end of a date indicates that the whole of the date may be correct, but it's not definitely correct.
    2015-06? means possibly June 2015 but that's not certain.

At Level 2, a "?" at the right of a date component indicates that everything to the left is uncertain.
    2015-06?-14 means the day is the 14th, but "June 2015" isn't certain.

At Level 2, a "?" at the left of a date component indicates that the component is uncertain.
    2015-?06-14 means "2015" and "the 14th" are certain, but "June" is uncertain.


Approximate ("~"). At Level 1, a "~" at the end of a date indicates that the whole of the date is approximate.
    2015-06-14~ means approximately 14 June 2015.

At Level 2, a "~" at the right of a date component indicates that everything to the left is approximate.
    2015-06~-14 means the day is the 14th, but "June 2015" is approximate.

At Level 2, a "~" at the left of a date component indicates that the component is approximate.
    2015-~06-14 means "2015" and "the 14th" are certain, but "June" is approximate.


Approximate and uncertain ("%"). At Level 1, a "%" at the end of a date indicates that the whole of the date is both approximate and uncertain.
    2015-06% means possibly June 2015 but that's approximate and not certain.

At Level 2, a "%" at the right of a date component indicates that everything to the left is both approximate and uncertain.
    2015-06%-14 means the day is the 14th, but "June 2015" is approximate and uncertain.

At Level 2, a "%" at the left of a date component indicates that the component is both approximate and uncertain.
    2015-%06-14 means "2015" and "the 14th" are certain, but "June" is approximate and uncertain.


Unspecified ("X"). At Level 1, putting "X" in particular positions means that the digit expected at that position hasn't been specified. I'll quote from the Library of Congress EDTF website:

  1. A year with one or two (rightmost) unspecified digits in a year-only expression (year precision)
    Example 1       '201X'
    Example 2       '20XX'
  2. Year specified, month unspecified in a year-month expression (month precision)
    Example 3       '2004-XX'
  3. Year and month specified, day unspecified in a year-month-day expression (day precision)
    Example 4       '1985-04-XX'
  4. Year specified, day and month unspecified in a year-month-day expression  (day precision)
    Example 5       '1985-XX-XX'

At Level 2, the "X" can make any digit unspecified. Quoting again:


Combinations. The uncertainty characters (? ~ % X) can be combined at Level 2. The 2016 ISO draft gives the example 2004?-06-~11 = ?2004-06-~11, which in both cases means Year uncertain, month known, day approximate.


Interval uncertainty. At Level 1, there are two ways to show uncertainty in a start-date/end-date expression. A right-slash ("/") is an interval marker, so leaving one side of the marker blank indicates that the blank side isn't known.
    2015-06-14/ means Interval started 14 June 2015, end unknown
    /2015-06-14 means Interval ended 14 June 2015, start unknown

Also at Level 1, a double-period ("..") shows that the start or end of the interval is either open or unspecified.
    2015-06-14/.. means Interval started 14 June 2015, end not specified
    ../2015-06-14 means Interval ended 14 June 2015, start not specified

At Level 2, you can add to an interval any of the uncertainty characters (? ~ % X), with their usual meanings, so 2015-?06-XX/2015-08?-23 means something like Started in 2015 on an unspecified day possibly in June, and ended on the 23rd, possibly but not definitely in August 2015


Speaking as a member of the Data Workers Union (Boring Old Fart Division), I'd say the ISO 8601 date format extensions are impressive but unlikely to be widely adopted. You'd need a cheat sheet or a carefully written parser to translate them, as well a metadata explainer for your data compilations (e.g., ISO 8601 date entries follow the 2019 extensions at Level 2).

X for unspecified is a useful idea, though. It could replace, for example, YYYY-MM-00 in a database with the easily parseable and a little more understandable YYYY-MM-XX. But I still like the idea of a separate date-notes field to explain uncertainties:

Date: 2015-XX-XX/XXXX-XX-23
Date-notes: According to Smith the program started in 2015 on an unspecified day possibly in June, and according to Jones it ended on the 23rd, possibly in August 2015


Last update: 2020-02-12
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License