For a full list of BASHing data blog posts, see the index page.     RSS

How to validate ISO 8601 dates without regex

Dates in ISO 8601 format have several advantages over dates in older formats, but it's still possible to get an ISO 8601 date wrong at the data entry stage. This post covers several ways to check ISO 8601 calendar dates with AWK for format and content errors.

ISO 8601 calendar dates have the forms

Any of these three forms can also be used in a calendar date interval. A commonly used ISO 8601 interval is "start/end", for example "1999-08-15/2000-05" for "15 August 1999 to May 2000".

The tab-separated data table being checked ("ypm") in what follows has 311484 records and is from the Peabody Museum of Natural History at Yale University, New Haven, Connecticut, USA. The key field for checking is eventDate (field 30), whose entries are either blank, or ISO 8601 calendar dates and "start/end" intervals. I'll be identifying individual records using their entries in the catalogNumber field (field 16).

Check the components. AWK can split a string, then split each of the resulting pieces. The "split" function takes as arguments the string to be split, the array into which the pieces will be put, and the character or characters on which to do the splitting. The string to be split can be an array item from a previous splitting.

I use an AWK command to first split an eventDate into the piece before a "/" ("a[1]" in array "a"), and the piece after a "/" ("a[2]" in array "a"). The command then splits "a[1]" into the components separated by a "-", namely "b[1]", "b[2]" and "b[3]" in array "b". The command then repeats this process with "a[2]", generating "c[1]", "c[2]" and "c[3]" in array "c". If the date isn't an interval or some components are missing (as in a YYYY-only entry), the missing components are empty strings in the arrays.

To check for invalid components, I need to specify conditions. I'll ignore the header line and any records in which eventDate is blank. I'll test each of the split-out pieces, but only after ensuring that the piece isn't blank. The test for the year pieces (b[1] and c[1]) is that the year is less than 1600 or greater than 2018 (in this particular dataset). The test for the month pieces (b[2] and c[2]) is that the entry is less than 1 or greater than 12. The test for the day pieces (b[3] and c[3]) is that the entry is less than 1 or greater than 31. If I find an invalid entry, I'll print out its value and the corresponding catalog number. Finally, I'll sort the results by catalog number.

Spread across several lines, the command looks like this:

awk -F"\t" 'NR>1 && $30 != "" \
{split($30,a,"/"); split(a[1],b,"-"); split(a[2],c,"-"); \
if (b[1]<1600 || b[1]>2018 || (c[1] != "" && (c[1]<1600 || c[1]>2018)) \
|| (b[2] != "" && (b[2]<1 || b[2]>12)) || (c[2] != "" && (c[2]<1 || c[2]>12)) \
|| (b[3] != "" && (b[3]<1 || b[3]>31)) || (c[3] != "" && (c[3]<1 || c[3]>31))) \
print $16 FS $30}' ypm | sort

And here's the result (5 seconds processing time for 311484 records on a core i5 desktop, 8GB RAM):


Notice that I ignored the fact that many of the array strings have a leading zero. That's because AWK's "greater than/less than" operators ignore any leading zeroes, as shown below:


Calendar peculiarities. Is there a 31 April in the dataset? is another test for validity. It can be asked with AWK simply by searching for invalid entries, as in the example below, without having to split the date string:


The full check for month peculiarities (below) shows that eventDate in the "ypm" table has no 30 or 31 February and no 31st day of April, June, September or November.


Another month check is for 29 February. This leap year day occurs in every year evenly divisible by 4, except century years not evenly divisible by 400. Leap year days occur only rarely in the datasets I audit, so I'll just look for all the unique eventDate entries containing 29 February, then inspect their year numbers:


All good.

Interval logic. The last check looks for "start/end" intervals in which the end date is earlier than the start date. In AWK, the "greater than" (">") and "less than" ("<") operators are also "later than" and "earlier than" operators for ISO 8601 dates, as shown below:


The interval logic test compares the two parts of each interval date after splitting on the "/" character. First I'll count the number of "start/end" intervals in the right order, then the ones in the wrong order:

awk -F"\t" '$30 ~ /\// {split($30,a,"/"); if (a[1]<a[2]) print $16 FS $30}' ypm | wc -l
awk -F"\t" '$30 ~ /\// {split($30,a,"/"); if (a[1]>a[2]) print $16 FS $30}' ypm | wc -l


45 in reversed order is quite a few, so I'll save those records to a file, "reversed":

awk -F"\t" '$30 ~ /\// {split($30,a,"/"); if (a[1]>a[2]) print $16 FS $30}' ypm | sort > reversed

The contents of "reversed":


Last update: 2018-10-05
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License