Special topics


Hybrid occurrence tables

Best practice with occurrence data is either to include all the event data items in each occurrence record, or to separate out the event data items in an event.txt table, and replace them in occurrence.txt with a corresponding eventID entry.

This doesn't always happen. Some Darwin Core datasets have both an event.txt table and an occurrence.txt table, with one or more event fields (such as eventDate) repeated in the occurrences table.

Are the repeated data items the same in both tables for the same eventID? There isn't an easy check to answer this question, but here I'll demonstrate one such check with a pair of fictitious data tables (below). In both tables there are the event data fields decimalLatitude, decimalLongitude, eventDate and recordedBy fields. To save space I've abbreviated some field names, e.g. "decLat".

event.txt:

ideventIDeventDatedecLatdecLonrecBy
bos1bos12025-03-16-24.7273135.6194B. Tor
bos2bos22025-03-16-24.7268135.6201B. Tor
bos3bos32025-03-17-24.9113135.6128B. Tor | C. Ham
bos4bos42025-03-18-24.9106135.6121B. Tor | C. Ham
bos5bos52025-03-19-24.9097135.6099B. Tor | C. Ham

occurrence.txt:

idoccIDsciNameeventIDdecLatdecLoneventDaterecByindCount
pog001pog001Pogona vitticepsbos1-24.7273135.61942025-03-16B. Tor1
pog002pog002Varanus giganteusbos1-24.7273135.61942025-03-16B. Tor1
pog003pog003Pogona vitticepsbos2-24.7268135.62012025-03-16B. Tor2
pog004pog004Varanus giganteusbos2-24.7268135.62012025-03-16B. Tor1
pog005pog005Pogona vitticepsbos3-24.9113135.61282025-03-17B. Tor | C. Ham1
pog006pog006Varanus giganteusbos3-24.9113135.61282025-03-17B. Tor | C. Ham1
pog007pog007Suta sutabos3-24.9113135.61282025-03-17B. Tor | C. Ham3
pog008pog008Pogona vitticepsbos4-24.9106135.61212025-03-18B. Tor | C. Ham2
pog009pog009Varanus giganteusbos4-24.9016135.61212025-03-18B. Tor | C. Ham1
pog010pog010Suta sutabos4-24.9106135.61212025-03-18B. Tor | C. Ham1
pog011pog011Aspidites ramsayibos4-24.9106135.61212025-03-18B. Tor1
pog012pog012Varanus giganteusbos5-24.9097135.60992025-03-19B. Tor | C. Ham1

The first step in the check is to build a concordance that gives field numbers for the repeated fields (including eventID), first in event.txt, then in occurrence.txt. Note that I've excluded the id field (field 1), since if it's present it will presumably always be different between the two tables:

awk -F"\t" 'ARGIND==1 && FNR==1 {for (i=2;i<=NF;i++) a[$i]=i} ARGIND==2 && FNR==1 {for (j=2;j<=NF;j++) if ($j in a) print $j FS a[$j] FS j}' event.txt occurrence.txt

concordance

Step 2 is to build a tab-separated, combined events+occurences table with just the shared fields plus occurrenceID. Using AWK, this table-building has to be done using the concordance results, and since these can vary from dataset to dataset, the following command is "ad hoc" for my demonstration tables:

awk -F"\t" 'FNR==NR {a[$2]=$2 FS $4 FS $5 FS $3 FS $6; next} $4 in a {print a[$4] FS $2 FS $5 FS $6 FS $7 FS $8}' event.txt occurrence.txt

combo

The third and final step is to work through the combined table and identify data items that differ for the same field and the same eventID. The AWK command that does this is again "ad hoc" because it depends on the placement of fields in the combined table. Since this third command works on the combined table, the step 2 and 3 commands are chained together. Again, I've abbreviated to save space:

awk -F"\t" 'FNR==NR {a[$2]=$2 FS $4 FS $5 FS $3 FS $6; next} $4 in a {print a[$4] FS $2 FS $5 FS $6 FS $7 FS $8}' event.txt occurrence.txt | awk -F"\t" 'BEGIN {print "eventID\toccurID\tfield\tev.txt\toc.txt"} NR==1 {for (i=1;i<=NF;i++) b[i]=$i} NR>1 {for (j=2;j<=5;j++) if ($j != $(j+5)) print $1 FS $6 FS b[j] FS $j FS $(j+5)}'

differences

This method successfully identifies the differing data items and reports their eventID, occurrenceID, fieldname and values in both event.txt and occurrence.txt. However, it's a very fiddly method and depends on careful attention to field numbers. There are probably better ways to do this with higher-level languages, such as Python.