Taxa, places and dates

Taxa
Places
Dates

Elsewhere on this website are methods for checking for invalid and incorrectly formatted entries in fields, and for disagreements between fields. Below are some notes on what I see as good Darwin Core practice. The opinions expressed are my own and may not be shared by Darwin Core maintainers or GBIF staff.


Taxa

The key taxon fields are scientificName and taxonRank. Note that scientificName should only contain formal and correctly formatted names, and no qualifiers. Other forms in the original record can go in verbatimIdentification and qualifiers in identificationQualifier. The current taxon rank may be different from the one in the original record, and Darwin Core has a verbatimTaxonRank field for that original ranking.

Authorship can be included in scientificName (good idea!) but if it isn't, then authorship can go in a scientificNameAuthorship field, and it should be formatted there exactly as it would be if it was in scientificName.

It is helpful for data searching if scientificName is "decomposed" into genus, specificEpithet and (if appropriate) infraspecificEpithet and cultivarEpithet fields. Darwin Core also has fields for taxon names between genus and species epithet. These are subgenus and infragenericEpithet. If an animal is classified as "Hamus (Notohamus) vulgaris", then subgenus should have "Hamus (Notohamus)" and infragenericEpithet should have "Notohamus". Botanical sections can also appear in infragenericEpithet; for the scientificName "Vicia sect. Cracca", infragenericEpithet can be "Cracca".

It is also helpful for data searching if the higher-taxon classification of scientificName is given in the available fields. I would recommend always having kingdom, phylum, class, order and family, but Darwin Core also has superfamily, subfamily, tribe and subtribe fields. The higherClassification field can be used for additional categories (like infraclass).

scientificName contains the name used by the identifier or in an original record, but the people preparing the Darwin Core dataset (GBIF calls them "content providers") may have a different idea about what the currently accepted name should be. That name should go in acceptedNameUsage along with an entry in taxonomicStatus that explains what the scientificName entry is, if it's different from the acceptedNameUsage entry.

Darwin Core recently added another field, genericName , for cases where a species has moved between genera. The example given in the Darwin Core guide looks like this:

scientificName = Felis concolor
acceptedNameUsage = Puma concolor
genus = Puma
genericName = Felis

There are more taxon fields in Darwin Core ( see https://dwc.tdwg.org/terms/#taxon) and these can be helpful when building a checklist dataset. They are not normally used for occurrence datasets.


Places

If you have already used within-field checks and between-field checks on place fields, then it's worth looking more closely at coordinates.

There are four key place fields: decimalLatitude, decimalLongitude, geodeticDatum and coordinateUncertaintyInMeters. Together these define where an observation was made or a sample collected. If there is only latitude and longitude for an occurrence, the location data are incomplete.

If you leave out a geodetic datum, GBIF will assume that your coordinates are based on the WGS84 datum, even if they aren't. The "WGS84" is just a guess, and the difference between the point with WGS84 and the point with another datum could be several hundred metres or more. If decimalLatitude and decimalLongitude have entries, then geodeticDatum should also have an entry, even if that entry is "unknown".

The other item that should not be missing is coordinateUncertaintyInMeters (cUIM). cUIM is the radius of the smallest circle (around the point specified by the latitude and longitude) in which the observation was made or the sample collected. This is the so-called "point-radius" method for defining a location.

Latitude/longitude coordinates specify a point, but that point does not have a size. It's infinitesimally small. No one can safely assume that the coordinates actually mean "about here, or maybe plus or minus 50 metres", or "this was the GPS reading somewhere on our big sampling plot".

If there is no cUIM, then we do not know how close the observation or collection was to the latitude/longitude point. 10 meters? 100? 1000? 10000? 100000? With a cUIM, the location expands from a point to a circle, and the data say that the occurrence was definitely within that circle.

Unfortunately, many Darwin Core datasets datasets do not have cUIM entries for their coordinates. They should, because uncertainty really matters when GBIF data are used for research purposes. (See Marcer et al. 2020; Marcer et al. 2022.)

There is an excellent online resource from GBIF (Chapman and Wieczorek 2020) that describes in detail how to estimate cUIM when the location information comes from a specimen label or a spot on a map. For some other suggestions, see this GBIF forum post.

A Darwin Core place field that data compilers sometimes get wrong is coordinatePrecision. This is a decimal representation of the number of decimal places used in decimalLatitude and decimalLongitude.

However, that is not the whole story. If decimalLatitude and decimalLongitude are -43.2228 and 145.6008, then coordinatePrecision is 0.0001. If decimalLatitude and decimalLongitude are ‑43.2000 and 145.6000, then coordinatePrecision is 0.1, not 0.0001.

A further complication appears when decimalLatitude and decimalLongitude have been calculated from non-decimal coordinates. For example, if verbatimLatitude is 43°13'22"S, then decimalLatitude is -43.2228 but coordinatePrecision is 0.000278, not 0.0001, because the latitude was originally recorded to the nearest second in degree-minutes-seconds format. See https://dwc.tdwg.org/terms/#dwc:coordinatePrecision and https://docs.gbif.org/georeferencing-best-practices/1.0/en/ for more information about conversions, and if you are still uncertain about coordinatePrecision, do not include it in your Darwin Core table.

A more basic question is: how many decimal places should there be in decimalLatitude and decimalLongitude? The answer is: not more than you can justify (see cartoon, below, from xkcd). A good guide to number of decimal places is in Wikipedia.

I am always surprised to see more than five decimal places in Darwin Core coordinates, and I recommend reading this GBIF forum post before copying long coordinate numbers from a spreadsheet or GIS program and pasting them into a Darwin Core dataset.

xkcd cartoon

Dates

Dates and times in Darwin Core tables should always be in ISO 8601 format, and "00", "XX", "??" etc are not valid elements in ISO 8601 dates.

Use the interval date format for an interval. If samples were collected from 1 to 7 November 2013, then eventDate should be 2013-11-01/07 (or 2013-11-01/2013-11-07), not 2013-11-01 (the start date), 2013-11-04 (the midpoint date) or 2013-11-07 (the end date).

Please also note that if you are including a time in eventDate, it should be separated from date with a "T", like this for 25 minutes after 6 in the evening on 24 June 2023, UTC time: "2023-06-24T18:25Z". If you are not familiar with UTC and time-zone designations, please see https://en.wikipedia.org/wiki/ISO_8601 and https://dwc.tdwg.org/terms/#dwc:eventTime.

A useful function is dates:

dates() {
grep -E '02-30|02-31|04-31|06-31|09-31|11-31|02-29
}

Assuming all the years in eventDate and year are OK, and the months in eventDate and month are in the range 1-12, and the days in eventDate and day are in the range 1-31, then "dates" will find any entries with the invalid entries YYYY-02-30, YYYY-02-31 etc. It will also return dates with YYYY-02-29 (leap year 29 February); in this case the years should be divisible by 4, but note that 1700, 1800 and 1900 did not have a 29 February in the Gregorian calendar.

I use dates like this: tally [filename] [eventDate field] | cut -f2 | dates, and if any invalid dates appear I search for them with grep, AWK (to also return a line number or ID) or a text editor.

startDayOfYear and endDayOfYear are not often included in Darwin Core datasets. This may be because their use is confusing, and because an interval date in eventDate, like "2023-06-24/28" already contains a start day and an end day, and the day numbers can be looked up if needed (e.g. at https://landweb.modaps.eosdis.nasa.gov/browse/calendar.html).

You can check on the command line if an sDOY or an eDOY agrees with an ISO 8601 date. Use this function:

chkday() {
awk -F"\t" -v isodate="$2" -v dayno="$3" 'NR>1 && $isodate != "" && $dayno != "" {split($isodate,a,"-"); b=strftime("%j",mktime(a[1]" "a[2]" "a[3]" 0 0 0")); if (b != sprintf("%03d",$dayno)) print $isodate FS $dayno FS b}' "$1"
}

chkday takes the filename, the ISO8601 date and the day number (sDOY or eDOY) as arguments, and if there are disagreements it returns the ISO 8601 date, the supplied day number and the correct day number. In the screenshot below, the last 10 disagreements are shown for a check in which chkday tested the file "dmns", which had an ISO 8601 date in field 28 and an sDOY (no eDOY field) in field 29.

chkday