Many of the recipes here rely on AWK, which is an interpreted programming language first developed in the 1970s by Alfred Aho, Peter Weinberger and Brian Kernighan. AWK is designed to process text files consisting of records broken up into fields — such as data tables. The version included in many current Linux distributions is 'gawk 4' (e.g. GNU AWK 4.1.1, 2014), and that's the AWK version used in this cookbook.
AWK is powerful and elegant. It's powerful because it can do so many different processing jobs, and it's elegant because you don't need to write much to tell AWK how to do its job.
There are many introductions to AWK arrays on the Web, but none I've seen are both comprehensive and elementary enough for an AWK beginner. One of my own efforts is here. Clear explanations of GNU AWK 4 arrays are in a manual written by the chief GNU AWK 4 developer, available as a website, a free PDF and a non-free printed book. Another good way to learn AWK arrays is to see how they're used to solve specific problems on websites like Stack Overflow.
GNU sed (version 4.2.2 is used here) is a command-line text editor that processes a file line by line, like AWK. This cookbook mainly uses sed for data-cleaning recipes.
A few online resources for sed are good for beginners. The first 3 are again by Dan Robbins:
Sed by example, Part 1
Sed by example, Part 2
Sed by example, Part 3
The SED FAQ
Unix - Regular Expressions with SED
Almost uniquely among command-line tools, sed has an in-place option, '-i'. In other words, you can process a file with sed without having to generate a new, processed file and leave the old file untouched. Here I replace all instances of 'aaa' in file1 with 'bbb', first by generating a new file, then by replacing in-place:
$ sed 's/aaa/bbb/g' file1 > new_file1
$ sed -i 's/aaa/bbb/g' file1
The '-i' option can be dangerous if you haven't followed data-cleaning rule no. 1! However, you can create a backup of a file at the same time that you modify it, by following the '-i' with an addition to the filename:
$ sed -i_old 's/aaa/bbb/g' file2
Regular expressions, or 'regex', are sometimes complicated and confusing, but there are some excellent explainers online. The best and most complete I've seen is by Jan Goyvaerts:
and there's a clever try-it-yourself regex builder at