For a full list of BASHing data blog posts see the index page.

A short rant about Python, R and UNIX

In 2020 you can't google "data science", "data cleansing", "data wrangling" or almost any other search term beginning with "data" without facing a wall of links to Python and R webpages.

Python and R are excellent frameworks for data analysis and visualisation. The "data" in that last sentence, though, is clean and tidy data. That's not what Pythonistas and "useRs" start with. Like everyone else who works with data, they have to spend time — often a lot of time — getting their data into a state fit for use.

Both Python and R have been extended, clumsily, to handle the janitorial work needed to clean and tidy data before it can be analysed or migrated. The data is typically plain text, structured in a rectangular array, one record per line, data items separated in fields.

Decades before the advent of Python 2 (2000) or R (stable beta in 2000), UNIX had fast, reliable programs for handling just this data structure: cut, grep, head, less, nl, paste, sed, sort, tail, tr, uniq and wc. (These are all now part of the GNU/Linux toolkit.)

In fact, in the late 1970s a stripped-down programming language called AWK was developed specifically for operations on plain text data structured as records broken into fields. AWK has been expanded and optimised ever since. It can do a lot more today than was coded into its abilities 40+ years ago but it's still simple and flexible in its syntax.

Notice how the UNIX-era tools have short, simple names? That's how you invoke them, too. Just type head -20 and a filename in a terminal and you get that file's first 20 lines printed to the screen.

No importing modules or libraries, either. The UNIX tools are already on your computer (Linux, Mac) or at least available (Windows Subsystem for Linux). No preparation necessary, you open a terminal and you start working with your data.

This rant was inspired, in part, by reading a tutorial about R's pipe operator, %>%, which allows you to chain together two operations. You can take the output of the first operation and make it the input of the second operation by putting %>% between them. The R pipe first appeared in 2013, exactly 40 years after the pipe | was introduced to do just that job in UNIX, with two fewer characters to type.

It's puzzling. Is there some sort of fanatic loyalty to Python and R that makes their users struggle with data operations that AWK and UNIX tools could do more simply? Or is it that these users suffer from Windows and can run Python in a Windows shell and R in a Windows IDE, but until recently had no way of running GNU/Linux programs within Windows on their computers?

In any case, Pythonistas and useRs could save themselves a lot of work and do more with their raw data if they started using those UNIX-era tools and AWK. Here are some online resources for beginners:

UNIX and GNU/Linux tools for text processing
learnbyexample
Text processing in the shell
Brad Yoes' tutorial
Wikipedia list of links to Wikipedia articles on individual commands
...and many, many other resources best found by searching for the tool and examples

AWK
tutorialspoint
learnbyexample
Daniel Robbins' AWK tutorials: part 1, part2, part 3
Bruce Barnett's AWK tutorial
Patrick Hartigan's examples
Idiomatic AWK

Data cleaning
A Data Cleaner's Cookbook

Last update: 2020-10-28
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License