Between tables

event.txt and occurrence.txt
Cross-table date checking
Linking tables


event.txt and occurrence.txt

Events and occurrences in Darwin Core are usually one-to-many. In other words, every unique event (date, place, collector, method) will generate several occurrence records, with different species collected or observed at the same place, same time, same collector/observer, same method.

An event.txt table with its event details is linked to an occurrence.txt table by an eventID field. Unfortunately, it often happens that the eventID in an occurrence.txt record is not matched by an eventID in event.txt. This means that the tables cannot be properly joined (technically, a failure in "referential integrity") and GBIF will ignore any records with an unmatched eventID.

To check for this problem, use the chkevoc script. It works on an event.txt file and an occurrence.txt file (with exactly those file names) in the same directory. The field numbers don't matter, as the script will find the fields. In addition to checking to see if every occurrence eventID has a corresponding eventID in the events file, the script also checks for missing and duplicated eventID entries in event.txt and missing eventID entries and missing or duplicated occurrenceID entries in occurrence.txt.

In the screenshot below, chkevoc has found a large number of referential integrity failures:

chevoc

Often this failure happens because the two tables (event.txt and occurrence.txt) were compiled separately (in a spreadsheet) rather than in database software. Often the eventID missing in event.txt is "sort of" there, but formatted differently, for example, "3c3b_2012-09-20_box-corer" in event.txt but "3c3b_2012-09-20_box_corer" in occurrence.txt. They might look similar to a human, but they behave differently in a data-processing program.

Many Darwin Core table compilers don't split their records between an event.txt and an occurrence.txt file, instead putting all the event details in occurrence.txt, and there is nothing wrong with doing this.

However, I have seen datasets where some fields in event.txt have been repeated in occurrence.txt. For example, both tables might have eventID and eventDate fields. In a few cases, however, a check showed that in some occurrence.txt records eventDate was different for the same eventID, which is a data error.

These "hybrid" occurrence.txt tables are not safe to create. A GBIF staff member recently commented that GBIF would prefer that data compilers "use an event.txt file for all event-field information and include only eventID in occurrence.txt" (https://github.com/gbif/portal-feedback/issues/5024#issuecomment-1788877349).


Cross-table date checking

There are various ways in which an event.txt and an occurrence.txt table could disagree, but it isn't worth trying to imagine all of them! One disagreement worth checking is this: is eventDate in event.txt later than dateIdentified in occurrence.txt, for the same eventID that links the two tables?

The easiest way to do this check is to first create a new, combined table with this command:

awk -F"\t" 'FNR==NR {a[$eventID]=$eventDate; next} $eventID in a {print $occurrenceID FS $eventID FS a[$eventID] FS $dateIdentified}' event.txt occurrence.txt

The output is a TSV with the 4 fields

  • occurrenceID (from occurrence.txt)
  • eventID (from both event.txt and occurrence.txt)
  • eventDate (from event.txt)
  • dateIdentified (from occurrence.txt)

Next, pipe the output to this function:

2dates2files() {
awk -F"\t" 'BEGIN {print "occurrenceID\teventID\teventDate\tdateIdentified"} NR>1 && $3 != "" && $4 != "" {split($3,a,"-"); split($4,b,"-"); if ((a[1] > b[1]) || (a[2] && b[2] && a[1]"-"a[2] > b[1]"-"b[2]) || (a[3] && b[3] && a[1]"-"a[2]"-"a[3] > b[1]"-"b[2]"-"b[3])) print}'
}

This disagreement does indeed appear in real-world Darwin Core data, just as it sometimes happens when eventDate and dateIdentified are together in a single occurrence.txt table. For example, in a Darwin Core dataset I audited, in event.txt eventID was field 2 and eventDate was field 5, and in occurrence.txt occurrenceID was field 10, eventID was field 16 and dateIdentified was field 18:

cross dates

If there are no disagreements, 2dates2files will just print the header line.


Linking tables

Darwin Core allows for more than just an occurrences table (occurrence.txt). There are also events tables (event.txt), measurement tables (measurementorfact.txt), taxon information tables (taxon.txt) and others (see https://rs.gbif.org/extensions.html). All such tables can be packed into a single ZIP archive for sharing (for example, with GBIF).

The archive is based on the so-called "star" schema where there is a core table and one or more other tables that are linked to it. The linking is done by including a relevant "ID" field in each table. Please do not rely on an id field generated by IPT software, which may or may not allow linking by the user's processing software. Below are examples of linking "ID" fields. If these are missing in either table, they should be added.

If there are event.txt and occurrence.txt tables, the linking field in the two tables is eventID.
 
If there is a measurementorfact.txt table with data that refer to particular occurrences in occurrence.txt, then the linking field in both tables is occurrenceID (see also below).
 
If there is a taxon.txt table with data about taxa reported in occurrence.txt, then the linking field in both tables is taxonID. Checklist data in Darwin Core format usually have a taxon.txt table as the core, with "star" extensions like references.txt linked to taxon.txt by a taxonID field.

In the same measurementorfact.txt table you can have measurement data for events (like temperature and depth) and also measurement data about the samples recovered in those events (like fish length and weight). The table should have both an eventID field (to identify events that had event measurements) and an occurrenceID field (to identify occurrences that had occurrence measurements). Event measurement records would have occurrenceID blank.

This combining of linkages is logical and allows a user to extract event and occurrence measurements separately. There can also be a separate (or only) parentEventID field for parent event measurement data.