A data cleaner's cookbook

Introduction

This cookbook will help you check whether a data table (defined on the first data tables page) is properly structured and free from formatting errors, inconsistencies, duplicates and other data headaches.

All the data-checking and data-cleaning recipes on this website use GNU/Linux utilities and other command-line tools in a BASH shell and work on plain text files. If your data table is in a spreadsheet or a database, you first need to export the table as a text file, or copy/paste it into a text editor and save the plain text table as your working file.

Why go to all that trouble? Why not check your data table inside a spreadsheet or database? Because it's often much easier to find and fix errors in a plain text file. The command-line tools for processing plain text have up to 40 years of development behind them. The tools are fast, reliable, simple to use and easy to learn. They are particularly good at working with files too big for editing in a spreadsheet (millions of records, hundreds of fields).

Command-line recipes on this website are shown as below. This is real text, not an image, so you can copy and paste commands to try them out.

This is a command

Screenshots from my terminal look like this:

Some pages have comments, too:

This is a comment

The maximum width of the text on this website is about 800 pixels, so you can keep recipes open side-by-side with a terminal on a widescreen display.

Past versions of this Cookbook are available as downloads in Zenodo.

2024 update: A special version of this Cookbook is online for workers using the Darwin Core biodiversity standard.

Background you need

You know how to use the command line in a BASH shell, you know at least a little about regular expressions, and you may have used AWK and sed before. You know how to save a command as a function in your .bashrc file, and you know how to put a script in your $PATH and make it executable.

If you're a bit rusty on command-line operations, here are some helpful resources:

Using the shell:   Ryans tutorial; William Shotts' online book
Regular expressions:   Zytrax tutorial; regular-expressions.info
AWK:   tutorialspoint; "learnbyexample" on GitHub
sed:   tutorialspoint

If you run Linux on your computer, skip to Command-line programs you need, below.

If your computer is a Mac, your BASH shell is the "Terminal" app in Applications. Most of the commands used in this cookbook are already available in the Mac shell, but I strongly recommend that you install and use their GNU versions (e.g. with Homebrew).

If your computer runs Windows 10 or 11 you can access a BASH shell after installing the "Windows Subsystem for Linux". You may get better results if you dual-boot a Linux distribution and Windows, or if you run Linux in a virtual machine.

Command-line programs you need

Most or all of these are already included in your Linux distribution:

AWK (gawk; GNU AWK 4 or higher), cut, echo, grep (with PCRE capability), head, hexdump, iconv, less, nl, paste, pr, sed, sort, tail, tr, uniq, wc, xxd

I also recommended that you install column, pv and YAD dialog creator.

To keep this cookbook fairly short and simple I haven't explained how the recipes work. Some of the more complicated recipes have links to relevant BASHing data blog posts and other online resources.

Pro tips

Back up a data table before cleaning it.
This makes the original data table safe from unintended edits, and allows you to do before-and-after reporting. Good practice is to keep progressive backups as you make changes to a data table, numbering each new version serially, e.g. table0 (a copy of the original), table1, table2 etc. I don't recommend using the in-place editing options for sed (-i) or AWK (gawk, -i inplace).

Keep a log of every step in data auditing and cleaning.
A plain text file will do. Paste into the log a copy of every command and its output (or a summary of the output), in the order in which you used the commands. If problems arise during data cleaning, the log file will help you understand what went wrong. The log file can also be the basis of a report to the data owner.