logo

This version of the Cookbook and the first 100 posts in the companion blog BASHing data have been archived in Zenodo and can be downloaded for offline use. Links between the Cookbook and the blog are all local in the archived versions, so you can use both resources without needing to go online.

Introduction

This is version 2 of a cookbook that will help you check whether a data table (defined on the first data tables page) is properly structured and free from formatting errors, inconsistencies, duplicates and other data headaches.

All the data-checking and data-cleaning recipes on this website use GNU/Linux utilities and other command-line tools in a BASH shell and work on plain text files. If your data table is in a spreadsheet or a database, you first need to export the table as a text file, or copy/paste it into a text editor and save the plain text table as your working file.

Why go to all that trouble? Why not check your data table inside a spreadsheet or database? Because it's often much easier to find and fix errors in a plain text file. The command-line tools for processing plain text have up to 40 years of development behind them. The tools are fast, reliable, simple to use and easy to learn. They are particularly good at working with files too big for editing in a spreadsheet (millions of records, hundreds of fields).

Command-line recipes on this website are shown as below. This is real text, not an image, so you can copy and paste commands to try them out. Please note, however, that the < and & characters are represented on these webpages by their HTML equivalents, and may not paste properly into some GUI terminals.

This is a command

The maximum width of the text on this website is about 800 pixels, so you can keep recipes open side-by-side with a terminal on a widescreen display.


Background you need

You know how to use the command line in a BASH shell, you know at least a little about regular expressions, and you may have used AWK and sed before. You know how to save a command as a function in your .bashrc file, and you know how to put a script in your $PATH and make it executable.

If you're a bit rusty on command-line operations, here are some helpful resources:

Using the shell:   Ryans tutorial; William Shotts' online book
Regular expressions:   Zytrax tutorial; regular-expressions.info
AWK:   tutorialspoint; "learnbyexample" on GitHub
sed:   tutorialspoint

If you run Linux on your computer, skip to Command-line programs you need, below.

If your computer is a Mac, your BASH shell is the "Terminal" app in Applications. (A nice Mac alternative is iTerm2.) Most of the commands used in this cookbook are already available in the Mac shell, but I strongly recommend that you install and use their GNU versions (e.g. with Homebrew). Key programs are coreutils, gawk, grep and gnu-sed. (See this GitHub page on how to make gnu-sed your default sed)

If your computer runs Windows 10 you can access a BASH shell after installing the "Windows Subsystem for Linux". Microsoft's instructions are here. You'll get better results if you dual-boot a Linux distribution and Windows, or if you run Linux in a virtual machine.


Command-line programs you need

AWK (gawk; GNU AWK 4 or higher), cut, echo, grep (with PCRE capability), head, hexdump, iconv, less, nl, paste, pr, sed, sort, tail, tr, uniq, wc, xxd.

Recommended: column, pv, YAD dialog creator

To keep this cookbook fairly short and simple I haven't explained how the recipes work. Some of the more complicated recipes have links to relevant BASHing data blog posts and other online resources.


Pro tips

Back up a data table before cleaning it. This makes the original data table safe from unintended edits, and allows you to do before-and-after reporting. Good practice is to keep progressive backups as you make changes to a data table, numbering each new version serially, e.g. table0 (a copy of the original), table1, table2 etc. I don't recommend using the in-place editing options for sed (-i) or AWK (gawk, -i inplace).

Keep a log of every step in data auditing and cleaning. A plain text file will do. Paste into the log a copy of every command and its output (or a summary of the output), in the order in which you used the commands. If problems arise during data cleaning, the log file will help you understand what went wrong. The log file can also be the basis of a report to the data owner.