Introduction

Like its companion website A Data Cleaner's Cookbook, this Darwin Core table checker site will help you check whether a data table is properly structured and free from formatting errors, inconsistencies, duplicates and other data headaches.

Unlike the Cookbook, the focus here is on tables with biodiversity data following the Darwin Core standard.


Background you need

You know how to use the command line in a BASH shell, how to save aliases and functions in your .bashrc file, and how to put a script in your $PATH and make it executable. The following required programs are mainly GNU utilities and are all (or almost all) supplied with the mainstream Linux distributions:

AWK (gawk; GNU AWK 4 or higher), cut, echo, grep (with PCRE capability), head, hexdump, iconv, less, nl, paste, pr, sed, sort, tail, tr, uniq, wc, xxd

I also recommended that you install column, csvkit, pv and YAD dialog creator, if you don't already have them on your Linux system.

If you are running Windows I recommend dual-booting a Linux distribution such as MX Linux to work in BASH on the command line, although in 2023 Microsoft seems confident that Linux and Linux apps will work with WSL2 on Windows 11 or late Windows 10.

You will have more preparatory work to do if your computer is a Mac. The default Mac shell is zsh, which will need to be changed to BASH. The Mac utilities are BSD ones and you should install their GNU equivalents (with Homebrew, for example) and work out how to ensure that the commands you enter will launch the GNU versions. (This is not always easy — see here.)

Please note: Included in some of the functions on this website are these two very useful aliases:

alias uniqc="uniq -c | sed 's/^[ ]*//;s/ /\t/'"
alias barsep="sed 's/\t/ | /g'"

The uniq -c command (from the GNU "coreutils" package) counts unique items in a list, then puts the right-justified count one space to the left of the item. For many purposes I've found it better to have a left-justified count separated from the data item by a tab, and I get that with uniqc.

barsep replaces all the tab characters in a TSV with [space][vertical bar][space]. This makes the records easier to read in a terminal.

To avoid getting an error message when using functions with uniqc and/or barsep, add these aliases to your .bashrc file.


Topical outline of this website

Getting to TSV explains how to ensure your Darwin Core data table has tab-separated fields.

Fields deals with fieldnames and field-numbering tools.

Encoding & characters details how to check that your table is UTF-8 encoded and free of mojibake, invisible control and formatting characters, and non-simple character versions.

Structure looks for structural problems in Darwin Core tables.

Within fields explains how to check individual fields in a table.

Between fields covers inconsistencies between related fields.

Between records looks at four kinds of duplicates.

Between tables describes consistency checks, for example between event.txt and occurrence.txt.

Taxa, places and dates covers various issues involving scientific names, location data and dates.

Special topics covers data-checking matters that don't easily fit under any of the topics above.

Please note that although some publicly available Darwin Core data tables are shown on this website to have errors, in all cases these errors were subsequently corrected by the data compilers!


How to fix data problems?

Text additions, deletions and replacements are most easily and safely done in a good text editor. I recommend Geany for Linux, Mac and Windows, and Notepad++ for Windows.

Spreadsheets like Microsoft Excel are not safe places to do text editing, unless you are a skilled and very careful spreadsheet user. Safer alternatives are table editors, like the excellent Modern CSV for Windows, Mac and Linux, and EmEditor for Windows. Table editors look and act like spreadsheets but avoid spreadsheet hazards (and don't have formulas).

Additions, deletions and replacements can also be done on the command line, and I give some examples on this website.


Questions, corrections or suggestions?

Email me: robert.mesibov@gmail.com. This website will be periodically updated and your input is welcome.

Please note: this entire website can be downloaded from Zenodo for offline use.


Legal matters

The text and images on this website are my own work and are copyright under a Creative Commons Attribution-NonCommercial 4.0 International License. You are welcome to use or copy the information and images on this website for non-commercial purposes, but please attribute that use to this source.

Please note that the software commands on this website are provided "as is", without warranty of any kind, express or implied, including fitness for particular purposes. In no event shall the website author be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software commands on this website.


Dr Robert Mesibov
West Ulverstone, Tasmania
ORCID
CV and publications
Updated 18 February 2024