About this website
A Data Cleaner's Cookbook went online on 23 October 2016. I corrected and updated it frequently over the next three years. At the end of 2019 I began re-organising the site and adding new recipes and examples from the companion blog, BASHing data. The current version of the Cookbook first appeared on 13 January 2020.
If you find mistakes on this website or have suggestions for better recipes, please email me.
Robert Mesibov, West Ulverstone, Tasmania, Australia
Latest update: 2020-10-24
About the companion blog
On the BASHing data blog I write about
- Data auditing, cleaning and processing
- Characters and encoding
- Data formatting
- Data analysis
- AWK tips and tricks
- BASH tips and tricks
- Useful programs for command-line data ops
- Data entry and display
- The Windows and spreadsheet worlds
- Miscellaneous stuff
The blog posts have more examples and more background information than the Cookbook. If you like data work, keep up with the blog through its RSS feed.
I'm a data auditor and retired scientist, and I've been working with data tables for nearly 50 years. I started with printed columns on paper (and a calculator) before moving to spreadsheets and relational databases (Microsoft Access, Filemaker Pro, MySQL, SQLite).
In 2012 I discovered the AWK language and realised that every processing job I had ever done with data tables could be done faster and more simply on the command line. Since then my data tables have been stored as plain text and managed with command-line tools, especially AWK.
In case you're wondering "Which Linux?", I run MX on my desktop and my work laptop. For years I ran Debian (stable) Xfce, then the antix and Mepis communities put together MX: Debian (stable) Xfce nicely supplemented — but not overloaded — with handy new utilities and a solid selection of apps, and still very fast. I highly recommend MX as an all-purpose Linux distro.
Contact me directly if you would like a quote on a data auditing or data cleaning job. Here in Australia, I'm also happy to quote on training data workers (in person) in command-line methods.
About the banner image
The webpage banner shows a detail from a painting by the 17th-century Flemish artist David Rijckaert III. I like the look of concentration on the alchemist's face as he refers to a text. Working with the command line isn't alchemy, but sometimes it seems like magic.
The text and images on this website are my own work and are copyright under a Creative Commons Attribution-NonCommercial 4.0 International License. You are welcome to use or copy the information and images on this website for non-commercial purposes, but please attribute that use to this source.
Please note that the software commands on this website are provided "as is", without warranty of any kind, express or implied, including fitness for particular purposes. In no event shall the website author be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software commands on this website.