banner

About this blog

BASHing data is a companion blog to A Data Cleaner's Cookbook. The blog is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.

This page introduces the first BASHing data series, which includes 200 posts and ran from 2018 to 2022. Like A Data Cleaner's Cookbook, the first series is archived in Zenodo and can be downloaded for offline use. The second series of BASHing data began in 2024 and is a separate website.

About me

I'm a data auditor and retired zoologist.

Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com

The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Topic categories:


Posts by category (most recent post first):

Data auditing, cleaning and processing

People are the best data cleaners (2022-04-08)
     Between spreadsheets and Big Data analytics is the command line

Search for (exact) strings; report line, column and context (2022-03-09)
     A coloured grep for data tables

Detecting truncations: another sometimes successful method (2021-12-15)
     This is a difficult job and every command-line trick helps

A quick cross-file comparison with AWK (2021-11-10)
     AWK neatly does a tricky data comparison

Duplicate records differing only in unique identifiers - updated (2021-10-27)
     A much-improved method for finding these partial duplicates

Some regex tests with grep, sed and AWK (2021-10-20)
     Speed tests for different search/filter cases

How to do replacements based on multiple field values (2021-10-06)
     Command-line repairs with a powerfully simple tool

There's data missing - please explain (2021-06-30)
     A blank entry can have hidden meanings

The curious world of check digits (2021-06-16)
     How they work, and code to validate an ABN

The Incrementing Fill-Down Error (2021-05-26)
     Another data crime with spreadsheeting as the prime suspect

A data checker's checklist (2021-05-12)
     A draft outline of topics for the next online resource

How to fix "one2many" data issues (2021-03-17)
     Command-line repairs for a surprisingly common type of error

DIY primary/foreign key relationships, again — updated (2021-03-25)
     A script to check for primary/foreign key issues

Four kinds of data anomalies (2021-02-24)
     Anomalies might be out of range, out of place, out of match or out of date

How to find the missing parts of a series (2021-02-03)
     Command-line solutions for a simple and two more complicated cases

How to build a multi-file fields concordance (2020-12-23)
     Clearly show which fields have the same name in two or more files

Check the day of year, given a date (2020-11-18)
     Comparing ISO 8601 dates with their day numbers

How to keep an eye on field numbers (2020-11-04)
     Put the field numbers on a digital Post-it note with YAD

Three kangaroos in the ocean (2020-09-30)
     Ridiculous outliers can sometimes be worth salvaging

Finding one-to-many entries in a data table (updated) (2020-09-16)
     Too many B's for each A?

Checking DIY primary/foreign key relationships (2020-09-02)
     Problems when primary and foreign keys are hand-built

How to do a both/neither/one/other tally - updated (2020-09-06)
     A simple check on paired fields (like latitude and longitude) in a data table

How to find almost-duplicates (2020-07-01)
     Two methods that work with some (but not all) data tables

Add an issues field to a data table (2020-05-20)
     How to get records to self-report their problems

Spellchecking scientific names on the command line (2020-05-06)
     How to build and use a dictionary of scientific names

Targeted string replacements with sed and AWK (2020-04-08)
     Avoid the dangers in globally replacing A with B

A curious pair of data ops (2020-03-18)
     Multiple pivots and keying the unreadable

Moving averages with AWK (2020-03-04)
     A command for adding moving averages to a table

Topping and tailing, and the slowness of GNU sort — updated (2019-11-08)
     GNU sort can be a rate-limiting step in a pipeline

How to guess the field separator in a table (2019-10-04)
     Count up the likely field separators in the header line with AWK

Long, narrow tables vs short, wide ones (2019-08-16)
     Three tests of processing speed show that table shape doesn't matter

A bulk replacement GUI with YAD (2019-08-02)
     A shell script for "normalising" pseudo-duplicates in a data table

Finding malformed markup (2019-07-19)
     How to identify messed-up HTML tags in non-HTML documents

Leading and trailing whitespace (2019-06-28)
     How to find and delete "fore and aft" whitespace within fields in a data table

Growing the Cookbook's "broken" function (2019-05-31)
     A more informative way to tally up the number of fields in a data table

How to delete, insert and replace whole lines (2019-05-12)
     Use line addresses to target just the right lines

How to delete, insert and replace whole fields (2019-05-05)
     Cut and paste are usually the right tools for these jobs

Comparing fields across two tables (2019-02-03; updated)
     A script to check for changes in a field

How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons? (2018-12-16)
     On the command line, you can ignore everything but the numbers

Finding changepoints in a list, revisited (2018-12-06)
     Using AWK to find where values change in a list

How to find distances between lat/lons for geochecking (2018-11-07)
     When you're looking for big differences, an approximate method is fine

Bird watching with AWK and grep (2018-10-24)
     Showing off the fastest way to search a text file for strings in another file

How to validate ISO 8601 dates without regex (2018-10-05)
     Check for format and content errors in YYYY-MM-DD fields with AWK

Fightin' fields (2018-09-30)
     Finding disagreements between data fields can be challenging

Fuzzy matching in practice (2018-09-23)
     Tips for approximate matching with tre-agrep

48 sea levels and a trope for your terminal (2018-08-11)
     A bulk string replacement with AWK, and that ACCESS DENIED thing

Pseudo-blank ("empty") records and fields (2018-08-04)
     How to find not-quite-empty rows and columns in a data table

Time series ops (2018-07-23)
     Using AWK to summarise time series data

Partial duplicates (2018-07-14)
     One way to find "pseudoduplicated" records

Truncated data items (2018-07-04)
     Detecting truncations, such as a 100-character string clipped to 50 characters

Compare parts of strings (2018-05-22)
     How to use AWK's "split" function to compare parts of strings


Characters and encoding

Gremlin detection bigly improved and a NUL problem avoided (2021-12-08)
     The gremlin detector script has been rebuilt from scratch

How to watermark a UTF-8 plain text file (2021-11-24)
     Use an inconspicuous Unicode character and a placement code

How to find mixed Latin+Cyrillic words (2021-09-29)
     Blue Latin and red Cyrillic letters in words containing both

Show Unicode code points for UTF-8 characters (2021-09-15)
     Convert a character to its code point (\uxxxx) with shell tools

Yet another gremlin: the zero-width space (2021-09-01)
     How to find and kill it before it does mischief

What is +ACY- doing in the data? (2021-07-14)
     A strange encounter with UTF-7

Mojibake madness (2021-05-19)
     Spectacular examples of gibberish from recent data audits

Mojibake bonanza (2020-12-16)
     New mojibake origin puzzles from a museum database

Encoding detection smackdown (2020-09-23)
     enca vs file vs iconv vs isutf8 vs uchardet

Character equivalence classes 2: the nature of equivalence (2020-06-24)
     What does "something like" actually mean?

Character equivalence classes 1: search and replace (2020-06-17)
     How to find "something like" a character

More mojibake fun (2020-04-01)
     Easy-to-hard examples of translating from gibberish

Hunting gremlins (2020-01-22)
     A script to make invisible gremlin characters visible

Build your own character class inventories — updated (2019-12-27)
     Find out what [:alpha:] and [:cntrl:] mean in your system

Introducing the replo (2019-11-01)
     Character replacements by computers can be reversible, reconstructable or researchable

An unexpected character replacement (2019-10-18)
     Strange replacements of non-ASCII characters by R

Return of the mojibake detective (2019-07-05)
     Three new cases of mysterious character corruptions

Quotes as characters (2019-04-07; updated 2019-05-26)
     How to recognise the nine different kinds of single and double quotes

How to choose special characters, revisited (2019-03-24)
     Scripting a little GUI for copying/pasting your most often-used special characters

iconv and illegal input sequences (2018-09-13)
     Getting around a roadblock in changing the character encoding of a file

SCI and 62;c62;c62;c... (2018-08-25)
     A control character causes strange behaviour in GUI terminals

Mojibake detective work (2018-08-06)
     A close look at some character encoding problems

Question marks that aren't really question marks (2018-07-27)
     Some question marks show that a program doesn't understand a character's encoding

Combo characters (2018-06-09)
     How to deal with Unicode's combining characters


Data formatting

gron the JSON flattener (2022-03-23)
     Flattened JSON can be worked with shell tools

How to flatten ("unpivot") a data table (2022-03-16)
     Make a table into a list of values by row and column

Auto-incrementing version letters (2022-03-02)
     Solutions for building 101c, 101d, 102a, 102b...

A dog-cat-horse-turtle problem (2022-01-19)
     Seven solutions and counting for this one problem

Tidy tables for data processing (2022-01-12)
     Preparing data for programs that don't care about "pretty"

Building an ODT on the command line (2021-12-29)
     Bypass the GUI by starting with HTML

Making a transect into a point and circle (2021-12-22)
     Convert a WKT linestring to point-radius in metres

What's wrong with my footprintWKT? (2021-11-17)
     About WKT and unexplained "invalid" results in GBIF

On visual contrast and QR codes (2021-11-03)
     Boosting contrast makes blurry QR codes readable

TSV to CSV on the CLI (if you really have to) (2021-10-13)
     How to build an RFC4180-compliant CSV from a TSV

zbarimg and blurry QR codes (2021-08-25)
     Surprisingly well-blurred codes are still readable

Two data formatting tweaks (updated) (2021-08-11)
     Handy ways to make tab-separated fields more obvious

Reverse or shuffle a string in a particular field (2021-07-07)
     Shell tools or AWK can do this, or a mix

"Firstname Lastname" to "Lastname, Firstname", with complications (2021-06-23)
     Name parsing and formatting is rarely simple

CSV to table, table to CSV (2021-06-02)
     How to pivot and "de-pivot" a CSV table

Converting a list to a presence/absence table (2021-02-10)
     Re-formatting is easy with tidy, well-structured data

ASCII score bars and a gorblimey command (2021-01-27)
     How to build a string of characters and their complement

Form text and placeholders (2021-01-13)
     Form letters, diaries and mail merge in plain text

Comparing strings more clearly (2020-12-09)
     How to make and emphasise a string comparison between fields

Re-format blah,YYYYMMDD,blah as blah,YYYY,MM,DD,blah (2020-12-02)
     How to do it with sed or AWK: 7 methods

How to stack columns (2020-11-25)
     Turn a "columnated" table into a straight up-and-down one

Building a data table from a sentence (2020-10-07)
     How to expand a condensed data structure

Spotting spaces, and AWK's view of emptiness (2020-09-09)
     A simple way to show and count plain whitespaces,
     and "non-empty" vs "non-empty and non-zero" in AWK

How to number copy/pasted commands (2020-08-05)
     A neat way to number and indent commands and their outputs

Sharing data and metadata together (2020-07-29)
     How not to lose a data table's metadata

A quick repair job on a dislocated table (2020-07-15)
     Fixing a table with displaced fields

Extra commas in a CSV (2020-07-08)
     How to safely delete just the excess commas

Join consecutive lines if condition applies (2020-06-03)
     Simple ways to fix embedded newlines

Printing repeats within repeats, and splitting a list into columns (2020-05-27)
     Why I use pr rather than column for some columnating jobs

How to move selected lines within a file (2020-05-13)
     No need to cut and paste, use the command line

Dealing with an all-CAPS/first-CAP jumble (2020-04-29)
     How to normalise a mix of WORDS and Words

How to be uncertain with dates (2020-02-12)
     A skeptical look at some of ISO 8601's new extensions

JSON Lines: record-style JSON (2020-01-29)
     A bridge between table-style data and standard JSON

Emphasising text in the terminal (2019-12-13)
     Making selected strings stand out with ANSI codes

Embedded newlines without a clue (2019-11-15)
     Without clear markers for field fragments, you need to be creative

Add leading zeroes that aren't really leading (2019-09-13)
     How to format numbers when they're inside non-numeric strings

A GUI to re-order fields in a table (2019-08-30)
     A shell script for building a new table with reordered fields

The lat/lon floating point delusion (2019-08-09)
     That big building is at -33.8903169365705 151.198409720645? Really?

Renumber a list after inserting a line — updated (2019-07-27)
     A handy function for inserting and renumbering

Data from dingbats: copying down (2019-02-24)
     Copying down is easy in a spreadsheet, but it's also possible on the command line

Fancy numbering of records (2019-02-17)
     On the command line, you can number a list of records any way you like

Reformatting a list, cleverly (2019-01-27)
     Create horizontal lists from a vertical one

Horizontal sorting within a field (2019-01-13)
     There are two different ways to sort a field "horizontally", but neither of them is simple.

Changing the month format: a fairly general solution (2018-12-30)
     Build a look-up table and use the starting and finishing format in an AWK command

Putting information into a table from the table's filename (2018-12-13)
     The example adds a date from the filename to each record in the table

Unwrap your fasta (2018-12-01)
     How to concatenate the sequence lines in FASTA files

Repair job: separate the tandem repeats (2018-10-26)
     How to split a tandem repeat between fields

Too many lat/lon digits (2018-06-30)
     Rounding off latitude/longitude data to an appropriate number of significant figures

Embedded newlines (2018-06-23)
     How to safely remove embedded newlines


Data analysis examples

Online shopping and a one2many tweak (2022-02-23)
     How to group product purchases by customer

Are you 10000 days old yet? (2022-01-05)
     Three command-line ways to find out

Batch triangulation on the command line (2021-06-09)
     Locate a point given the distances to two other, located points

Hunting Excel date twins (2021-03-09)
     Microsoft's choice of starting dates leads to duplicate records

The myth of equinoctial gales (2020-10-14)
     Real-world wind data don't show equinoctial gales

What's wrong with these records? (2020-08-26)
     Tinkering with "present in these records, absent in those"

Checking date components across fields (2020-04-15)
     Does "date" agree with "year", "month" and "day"?

Life tables (2020-03-11)
     A sober look at the probability of dying in Australia

Data quality in iNaturalist downloads (2020-02-05)
     Top marks for data from the citizen-science iNaturalist project

Steady as she goes, in Darwin (2019-10-25)
     The daily temperatures in Darwin (Australia) are remarkably constant

Two ugly CSVs (2019-04-28)
     Open but messy data from the Australian Electoral Commission and Companies House

Dog and cat data (2019-03-31)
     A command-line exploration of five public datasets

Data with bulges (2019-03-10)
     Three cases of unexpectedly large values in a data item

Two special data validations (2019-03-03)
     Is that tree correctly located? Is that list of names and addresses truly regular?

Drugs on the command line (2019-01-06)
     A disappointing dive into drugs data from the US Food and Drug Administration

Has the rainfall pattern in my hometown changed? (2018-12-23)
     No obvious trends in number, length or intensity of rainfall events in recent years

Fun with BOM data (2018-07-11)
     Weather watching with wget and gnuplot

Pivoting airlines (2018-06-03)
     Using arrays of arrays to build a pivot table with AWK


AWK tips and tricks

How to use patsplit (GNU AWK) (2022-02-02)
     Another way to split a string with AWK

Combinations from 2 lists: speed trials (2021-12-01)
     Comparing two ways to build Cartesian products

Building a molar mass calculator (2021-03-24)
     A shell script with AWK doing the chemical formula parsing

Updating a file from a lookup table (2020-11-11)
     How to use an AWK array for lookup operations

How to use flags in AWK (revisited) (2020-10-21)
     Flags are handy for defining AWK's working range of records

The easy-going syntax of AWK commands (2020-02-26)
     AWK is flexible and tolerant in its command rules

Another surprising AWK trick (2019-12-06)
     Strings or numbers? It depends on what you're doing with them.

A muggle's guide to AWK arrays: 4 (2019-09-20)
     Easier and more flexible ways to sort array outputs

A muggle's guide to AWK arrays: 3 (2019-08-23)
     Reformatting and table joining using arrays

A muggle's guide to AWK arrays: 2 (2019-07-12)
     Working with two files, or the same file twice

A muggle's guide to AWK arrays: 1 (2019-06-07)
     Array naming, index strings and value strings

A surprising AWK trick (2018-05-27)
     A clever way to avoid using a flag in AWK


BASH tips and tricks

Put an editable command at the next prompt (2021-09-08)
     Two ways to send an unfinished command to a prompt

How to bookmark directories in the shell (2020-06-10)
     A couple of functions is all it takes

Brace expansion with variables and arrays: eval to the rescue (2020-04-22)
     eval, a BASH built-in, solves brace expansion problems

Getting around a subshell problem (2020-01-15)
     Something strange happens with buffering in a subshell

Working around the BASH brace expansion rule (2019-06-14)
     How to build Cartesian string products in BASH

The magic of BASH string expansion (2019-05-19)
     A simple trick that allows AWK and sed to use BASH as an interpreter

Avoiding senior moments with command-line functions (2018-11-13)
     The trick is to make the documentation available on the CLI


Useful programs for command-line data ops

Revisiting a command-line translator (2021-07-28)
     A handy tweak for the translate-shell program

VisiData: a table explorer for the terminal (2019-10-11)
     Display, sort, reformat and more with this CLI utility

Transpose, pivot and bin with GNU Datamash 1.4 (2019-05-24)
     Do complex data transformations more easily with Datamash

Parsing scientific names (gnparser) (2019-01-20)
     Scientific names are much harder to parse than personal names


Data entry and display

Mapping with gnuplot, part 3 (2022-04-06)
     How to create and animate "layers" on a gnuplot basemap

Mapping with gnuplot, part 2 (2022-03-30)
     How to build a good-quality, fixed-scale basemap with gnuplot

An AWK histogram with scaling (2021-09-22)
     In these histograms, bar length is scaled to the longest bar length

CSV viewers for CSV haters (2021-08-18)
     Two CLI tools and one GUI

Visualising data as a PGM image (2021-08-04)
     A not-very-successful experiment

A sunset surprise (2021-02-17)
     Data graphics help to explain a puzzling phenomenon

Changing TTY prompt, font and colors (2020-02-19)
     How to prettify your virtual terminals

Data validation on entry with YAD (2019-11-29)
     In praise of lookup lists for data entry, with help from YAD dialogs

Plotting data in the terminal with gnuplot (2019-06-21)
     A separate graphic is much better than an in-terminal plot

Making pictures with data (2019-04-14)
     How to display data bytes as image bytes

Mapping with gnuplot (2018-10-31)
     How to use gnuplot to put data points on a basemap

How to enter nothing in a database (2018-10-18)
     If you have nothing to say, say nothing

Displaying data from table fragments (2018-09-06)
     One way to build a tidy table from a jumble of data

A record pager built with YAD (2018-08-18; updated 2018-09-09)
     How to turn a YAD dialog into a GUI viewer/pager for records in a data table

GUI ways to view and edit big text files (2018-07-31)
     glogg, gvim, Geany and csvpad, but not spreadsheets

YAD repeat and edit (2018-05-21)
     How to avoid re-entering data in a YAD data entry form


The Windows and spreadsheet worlds

Apple + Microsoft = character confusion (2022-02-09)
     Saving a .docx to plain text can fail in odd ways

Spreadsheet annoyance no. 3: quotes have priority (2021-01-20)
     Beware of unmatched quotes in data items

A grizzle about captive data (2020-07-22)
     Don't confuse data with the Windows software that contains it

Spreadsheet annoyance no. 2 (2019-04-21)
     Spreadsheets make dates out of entries that aren't dates, but that's not all they mess up

The trouble with Windows CRLF (2019-03-17)
     Windows line endings are in a pain in the ... terminal

Getting data out of Excel safely (2019-02-10)
     Watch out for embedded linebreaks, comma problems and character encoding issues

Curse of the CSV monster (2018-07-18)
     CSV with broken records to TSV


Miscellaneous stuff

DNA-style frameshift cryptography (2022-02-16)
     Secret messaging inspired by biology

Scripting a temperature notifier (2022-01-26)
     How cold did it get last night, and how cold is it now?

The data worker's guide to psiphiorrhea (2021-07-21)
     Too many decimal places? There's a name for that

The little museum and its data (2021-06-01)
     Love affairs between science and IT don't always end well

A short rant about Python, R and UNIX (2020-10-28)
     Why would you clean data with Python or R?

A data table thousands of years old (2020-08-12)
     Modern record-keeping in ancient Mesopotamia

Second Tuesday of each month and a BASHing data century (2020-03-25)
     ncal and the 100th blog post

Msot popele can undreatnsd tihs setennce (2019-12-20)
     Garbling and ungarbling with shell scripts

Python and shell tools (2019-11-22)
     A comparison of three data operations

A command-line "Countdown" (UK) companion (2019-09-27)
     Fast solver for anagram puzzles, and a puzzle generator

Getting data from an Enphase Envoy S (2019-09-06)
     Two user-accessible JSON files with performance data

Data on clay (2018-09-20)
     Cheap data storage for thousands of years? Check.
     Ancient glyphs in your terminal? Check.