Special topics
Hybrid occurrence tables
Best practice with occurrence data is either to include all the event data items in each occurrence record, or to separate out the event data items in an event.txt table, and replace them in occurrence.txt with a corresponding eventID entry.
This doesn't always happen. Some Darwin Core datasets have both an event.txt table and an occurrence.txt table, with one or more event fields (such as eventDate) repeated in the occurrences table.
Are the repeated data items the same in both tables for the same eventID? There isn't an easy check to answer this question, but here I'll demonstrate one such check with a pair of fictitious data tables (below). In both tables there are the event data fields decimalLatitude, decimalLongitude, eventDate and recordedBy fields. To save space I've abbreviated some field names, e.g. "decLat".
event.txt:
id | eventID | eventDate | decLat | decLon | recBy |
bos1 | bos1 | 2025-03-16 | -24.7273 | 135.6194 | B. Tor |
bos2 | bos2 | 2025-03-16 | -24.7268 | 135.6201 | B. Tor |
bos3 | bos3 | 2025-03-17 | -24.9113 | 135.6128 | B. Tor | C. Ham |
bos4 | bos4 | 2025-03-18 | -24.9106 | 135.6121 | B. Tor | C. Ham |
bos5 | bos5 | 2025-03-19 | -24.9097 | 135.6099 | B. Tor | C. Ham |
occurrence.txt:
id | occID | sciName | eventID | decLat | decLon | eventDate | recBy | indCount |
pog001 | pog001 | Pogona vitticeps | bos1 | -24.7273 | 135.6194 | 2025-03-16 | B. Tor | 1 |
pog002 | pog002 | Varanus giganteus | bos1 | -24.7273 | 135.6194 | 2025-03-16 | B. Tor | 1 |
pog003 | pog003 | Pogona vitticeps | bos2 | -24.7268 | 135.6201 | 2025-03-16 | B. Tor | 2 |
pog004 | pog004 | Varanus giganteus | bos2 | -24.7268 | 135.6201 | 2025-03-16 | B. Tor | 1 |
pog005 | pog005 | Pogona vitticeps | bos3 | -24.9113 | 135.6128 | 2025-03-17 | B. Tor | C. Ham | 1 |
pog006 | pog006 | Varanus giganteus | bos3 | -24.9113 | 135.6128 | 2025-03-17 | B. Tor | C. Ham | 1 |
pog007 | pog007 | Suta suta | bos3 | -24.9113 | 135.6128 | 2025-03-17 | B. Tor | C. Ham | 3 |
pog008 | pog008 | Pogona vitticeps | bos4 | -24.9106 | 135.6121 | 2025-03-18 | B. Tor | C. Ham | 2 |
pog009 | pog009 | Varanus giganteus | bos4 | -24.9016 | 135.6121 | 2025-03-18 | B. Tor | C. Ham | 1 |
pog010 | pog010 | Suta suta | bos4 | -24.9106 | 135.6121 | 2025-03-18 | B. Tor | C. Ham | 1 |
pog011 | pog011 | Aspidites ramsayi | bos4 | -24.9106 | 135.6121 | 2025-03-18 | B. Tor | 1 |
pog012 | pog012 | Varanus giganteus | bos5 | -24.9097 | 135.6099 | 2025-03-19 | B. Tor | C. Ham | 1 |
The first step in the check is to build a concordance that gives field numbers for the repeated fields (including eventID), first in event.txt, then in occurrence.txt. Note that I've excluded the id field (field 1), since if it's present it will presumably always be different between the two tables:
awk -F"\t" 'ARGIND==1 && FNR==1 {for (i=2;i<=NF;i++) a[$i]=i} ARGIND==2 && FNR==1 {for (j=2;j<=NF;j++) if ($j in a) print $j FS a[$j] FS j}' event.txt occurrence.txt
Step 2 is to build a tab-separated, combined events+occurences table with just the shared fields plus occurrenceID. Using AWK, this table-building has to be done using the concordance results, and since these can vary from dataset to dataset, the following command is "ad hoc" for my demonstration tables:
awk -F"\t" 'FNR==NR {a[$2]=$2 FS $4 FS $5 FS $3 FS $6; next} $4 in a {print a[$4] FS $2 FS $5 FS $6 FS $7 FS $8}' event.txt occurrence.txt
The third and final step is to work through the combined table and identify data items that differ for the same field and the same eventID. The AWK command that does this is again "ad hoc" because it depends on the placement of fields in the combined table. Since this third command works on the combined table, the step 2 and 3 commands are chained together. Again, I've abbreviated to save space:
awk -F"\t" 'FNR==NR {a[$2]=$2 FS $4 FS $5 FS $3 FS $6; next} $4 in a {print a[$4] FS $2 FS $5 FS $6 FS $7 FS $8}' event.txt occurrence.txt | awk -F"\t" 'BEGIN {print "eventID\toccurID\tfield\tev.txt\toc.txt"} NR==1 {for (i=1;i<=NF;i++) b[i]=$i} NR>1 {for (j=2;j<=5;j++) if ($j != $(j+5)) print $1 FS $6 FS b[j] FS $j FS $(j+5)}'
This method successfully identifies the differing data items and reports their eventID, occurrenceID, fieldname and values in both event.txt and occurrence.txt. However, it's a very fiddly method and depends on careful attention to field numbers. There are probably better ways to do this with higher-level languages, such as Python.