banner

About this blog

BASHing data was a companion blog to A Data Cleaner's Cookbook. The blog was a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.

All 200 BASHing data posts (2018-2022) and version 3 of A Data Cleaner's Cookbook have been archived in Zenodo and can be downloaded for offline use. Links between the blog and the Cookbook are all local in the archived versions, so you can use both resources without needing to go online.

About me

I'm a data auditor and retired zoologist.

Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com

The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Topic categories:


Posts by category (most recent post first):

Data auditing, cleaning and processing

People are the best data cleaners (2022-04-08)
     Between spreadsheets and Big Data analytics is the command line

Search for (exact) strings; report line, column and context (2022-03-09)
     A coloured grep for data tables

Detecting truncations: another sometimes successful method (2021-12-15)
     This is a difficult job and every command-line trick helps

A quick cross-file comparison with AWK (2021-11-10)
     AWK neatly does a tricky data comparison

Duplicate records differing only in unique identifiers - updated (2021-10-27)
     A much-improved method for finding these partial duplicates

Some regex tests with grep, sed and AWK (2021-10-20)
     Speed tests for different search/filter cases

How to do replacements based on multiple field values (2021-10-06)
     Command-line repairs with a powerfully simple tool

There's data missing - please explain (2021-06-30)
     A blank entry can have hidden meanings

The curious world of check digits (2021-06-16)
     How they work, and code to validate an ABN

The Incrementing Fill-Down Error (2021-05-26)
     Another data crime with spreadsheeting as the prime suspect

A data checker's checklist (2021-05-12)
     A draft outline of topics for the next online resource

How to fix "one2many" data issues (2021-03-17)
     Command-line repairs for a surprisingly common type of error

DIY primary/foreign key relationships, again — updated (2021-03-25)
     A script to check for primary/foreign key issues

Four kinds of data anomalies (2021-02-24)
     Anomalies might be out of range, out of place, out of match or out of date

How to find the missing parts of a series (2021-02-03)
     Command-line solutions for a simple and two more complicated cases

How to build a multi-file fields concordance (2020-12-23)
     Clearly show which fields have the same name in two or more files

Check the day of year, given a date (2020-11-18)
     Comparing ISO 8601 dates with their day numbers

How to keep an eye on field numbers (2020-11-04)
     Put the field numbers on a digital Post-it note with YAD

Three kangaroos in the ocean (2020-09-30)
     Ridiculous outliers can sometimes be worth salvaging

Finding one-to-many entries in a data table (updated) (2020-09-16)
     Too many B's for each A?

Checking DIY primary/foreign key relationships (2020-09-02)
     Problems when primary and foreign keys are hand-built

How to do a both/neither/one/other tally - updated (2020-09-06)
     A simple check on paired fields (like latitude and longitude) in a data table

How to find almost-duplicates (2020-07-01)
     Two methods that work with some (but not all) data tables

Add an issues field to a data table (2020-05-20)
     How to get records to self-report their problems

Spellchecking scientific names on the command line (2020-05-06)
     How to build and use a dictionary of scientific names

Targeted string replacements with sed and AWK (2020-04-08)
     Avoid the dangers in globally replacing A with B

A curious pair of data ops (2020-03-18)
     Multiple pivots and keying the unreadable

Moving averages with AWK (2020-03-04)
     A command for adding moving averages to a table

Topping and tailing, and the slowness of GNU sort — updated (2019-11-08)
     GNU sort can be a rate-limiting step in a pipeline

How to guess the field separator in a table (2019-10-04)
     Count up the likely field separators in the header line with AWK

Long, narrow tables vs short, wide ones (2019-08-16)
     Three tests of processing speed show that table shape doesn't matter

A bulk replacement GUI with YAD (2019-08-02)
     A shell script for "normalising" pseudo-duplicates in a data table

Finding malformed markup (2019-07-19)
     How to identify messed-up HTML tags in non-HTML documents

Leading and trailing whitespace (2019-06-28)
     How to find and delete "fore and aft" whitespace within fields in a data table

Growing the Cookbook's "broken" function (2019-05-31)
     A more informative way to tally up the number of fields in a data table

How to delete, insert and replace whole lines (2019-05-12)
     Use line addresses to target just the right lines

How to delete, insert and replace whole fields (2019-05-05)
     Cut and paste are usually the right tools for these jobs

Comparing fields across two tables (2019-02-03; updated)
     A script to check for changes in a field

How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons? (2018-12-16)
     On the command line, you can ignore everything but the numbers

Finding changepoints in a list, revisited (2018-12-06)
     Using AWK to find where values change in a list

How to find distances between lat/lons for geochecking (2018-11-07)
     When you're looking for big differences, an approximate method is fine

Bird watching with AWK and grep (2018-10-24)
     Showing off the fastest way to search a text file for strings in another file

How to validate ISO 8601 dates without regex (2018-10-05)
     Check for format and content errors in YYYY-MM-DD fields with AWK

Fightin' fields (2018-09-30)
     Finding disagreements between data fields can be challenging

Fuzzy matching in practice (2018-09-23)
     Tips for approximate matching with tre-agrep

48 sea levels and a trope for your terminal (2018-08-11)
     A bulk string replacement with AWK, and that ACCESS DENIED thing

Pseudo-blank ("empty") records and fields (2018-08-04)
     How to find not-quite-empty rows and columns in a data table

Time series ops (2018-07-23)
     Using AWK to summarise time series data

Partial duplicates (2018-07-14)
     One way to find "pseudoduplicated" records

Truncated data items (2018-07-04)
     Detecting truncations, such as a 100-character string clipped to 50 characters

Compare parts of strings (2018-05-22)
     How to use AWK's "split" function to compare parts of strings


Characters and encoding

Gremlin detection bigly improved and a NUL problem avoided (2021-12-08)
     The gremlin detector script has been rebuilt from scratch

How to watermark a UTF-8 plain text file (2021-11-24)
     Use an inconspicuous Unicode character and a placement code

How to find mixed Latin+Cyrillic words (2021-09-29)
     Blue Latin and red Cyrillic letters in words containing both

Show Unicode code points for UTF-8 characters (2021-09-15)
     Convert a character to its code point (\uxxxx) with shell tools

Yet another gremlin: the zero-width space (2021-09-01)
     How to find and kill it before it does mischief

What is +ACY- doing in the data? (2021-07-14)
     A strange encounter with UTF-7

Mojibake madness (2021-05-19)
     Spectacular examples of gibberish from recent data audits

Mojibake bonanza (2020-12-16)
     New mojibake origin puzzles from a museum database

Encoding detection smackdown (2020-09-23)
     enca vs file vs iconv vs isutf8 vs uchardet

Character equivalence classes 2: the nature of equivalence (2020-06-24)
     What does "something like" actually mean?

Character equivalence classes 1: search and replace (2020-06-17)
     How to find "something like" a character

More mojibake fun (2020-04-01)
     Easy-to-hard examples of translating from gibberish

Hunting gremlins (2020-01-22)
     A script to make invisible gremlin characters visible

Build your own character class inventories — updated (2019-12-27)
     Find out what [:alpha:] and [:cntrl:] mean in your system

Introducing the replo (2019-11-01)
     Character replacements by computers can be reversible, reconstructable or researchable

An unexpected character replacement (2019-10-18)
     Strange replacements of non-ASCII characters by R

Return of the mojibake detective (2019-07-05)
     Three new cases of mysterious character corruptions

Quotes as characters (2019-04-07; updated 2019-05-26)
     How to recognise the nine different kinds of single and double quotes

How to choose special characters, revisited (2019-03-24)
     Scripting a little GUI for copying/pasting your most often-used special characters

iconv and illegal input sequences (2018-09-13)
     Getting around a roadblock in changing the character encoding of a file

SCI and 62;c62;c62;c... (2018-08-25)
     A control character causes strange behaviour in GUI terminals

Mojibake detective work (2018-08-06)
     A close look at some character encoding problems

Question marks that aren't really question marks (2018-07-27)
     Some question marks show that a program doesn't understand a character's encoding

Combo characters (2018-06-09)
     How to deal with Unicode's combining characters


Data formatting

gron the JSON flattener (2022-03-23)
     Flattened JSON can be worked with shell tools

How to flatten ("unpivot") a data table (2022-03-16)
     Make a table into a list of values by row and column

Auto-incrementing version letters (2022-03-02)
     Solutions for building 101c, 101d, 102a, 102b...

A dog-cat-horse-turtle problem (2022-01-19)
     Seven solutions and counting for this one problem

Tidy tables for data processing (2022-01-12)
     Preparing data for programs that don't care about "pretty"

Building an ODT on the command line (2021-12-29)
     Bypass the GUI by starting with HTML

Making a transect into a point and circle (2021-12-22)
     Convert a WKT linestring to point-radius in metres

What's wrong with my footprintWKT? (2021-11-17)
     About WKT and unexplained "invalid" results in GBIF

On visual contrast and QR codes (2021-11-03)
     Boosting contrast makes blurry QR codes readable

TSV to CSV on the CLI (if you really have to) (2021-10-13)
     How to build an RFC4180-compliant CSV from a TSV

zbarimg and blurry QR codes (2021-08-25)
     Surprisingly well-blurred codes are still readable

Two data formatting tweaks (updated) (2021-08-11)
     Handy ways to make tab-separated fields more obvious

Reverse or shuffle a string in a particular field (2021-07-07)
     Shell tools or AWK can do this, or a mix

"Firstname Lastname" to "Lastname, Firstname", with complications (2021-06-23)
     Name parsing and formatting is rarely simple

CSV to table, table to CSV (2021-06-02)
     How to pivot and "de-pivot" a CSV table

Converting a list to a presence/absence table (2021-02-10)
     Re-formatting is easy with tidy, well-structured data

ASCII score bars and a gorblimey command (2021-01-27)
     How to build a string of characters and their complement

Form text and placeholders (2021-01-13)
     Form letters, diaries and mail merge in plain text

Comparing strings more clearly (2020-12-09)
     How to make and emphasise a string comparison between fields

Re-format blah,YYYYMMDD,blah as blah,YYYY,MM,DD,blah (2020-12-02)
     How to do it with sed or AWK: 7 methods

How to stack columns (2020-11-25)
     Turn a "columnated" table into a straight up-and-down one

Building a data table from a sentence (2020-10-07)
     How to expand a condensed data structure

Spotting spaces, and AWK's view of emptiness (2020-09-09)
     A simple way to show and count plain whitespaces,
     and "non-empty" vs "non-empty and non-zero" in AWK

How to number copy/pasted commands (2020-08-05)
     A neat way to number and indent commands and their outputs

Sharing data and metadata together (2020-07-29)
     How not to lose a data table's metadata

A quick repair job on a dislocated table (2020-07-15)
     Fixing a table with displaced fields

Extra commas in a CSV (2020-07-08)
     How to safely delete just the excess commas

Join consecutive lines if condition applies (2020-06-03)
     Simple ways to fix embedded newlines

Printing repeats within repeats, and splitting a list into columns (2020-05-27)
     Why I use pr rather than column for some columnating jobs

How to move selected lines within a file (2020-05-13)
     No need to cut and paste, use the command line

Dealing with an all-CAPS/first-CAP jumble (2020-04-29)
     How to normalise a mix of WORDS and Words

How to be uncertain with dates (2020-02-12)
     A skeptical look at some of ISO 8601's new extensions

JSON Lines: record-style JSON (2020-01-29)
     A bridge between table-style data and standard JSON

Emphasising text in the terminal (2019-12-13)
     Making selected strings stand out with ANSI codes

Embedded newlines without a clue (2019-11-15)
     Without clear markers for field fragments, you need to be creative

Add leading zeroes that aren't really leading (2019-09-13)
     How to format numbers when they're inside non-numeric strings

A GUI to re-order fields in a table (2019-08-30)
     A shell script for building a new table with reordered fields

The lat/lon floating point delusion (2019-08-09)
     That big building is at -33.8903169365705 151.198409720645? Really?

Renumber a list after inserting a line — updated (2019-07-27)
     A handy function for inserting and renumbering

Data from dingbats: copying down (2019-02-24)
     Copying down is easy in a spreadsheet, but it's also possible on the command line

Fancy numbering of records (2019-02-17)
     On the command line, you can number a list of records any way you like

Reformatting a list, cleverly (2019-01-27)
     Create horizontal lists from a vertical one

Horizontal sorting within a field (2019-01-13)
     There are two different ways to sort a field "horizontally", but neither of them is simple.

Changing the month format: a fairly general solution (2018-12-30)
     Build a look-up table and use the starting and finishing format in an AWK command

Putting information into a table from the table's filename (2018-12-13)
     The example adds a date from the filename to each record in the table

Unwrap your fasta (2018-12-01)
     How to concatenate the sequence lines in FASTA files

Repair job: separate the tandem repeats (2018-10-26)
     How to split a tandem repeat between fields

Too many lat/lon digits (2018-06-30)
     Rounding off latitude/longitude data to an appropriate number of significant figures

Embedded newlines (2018-06-23)
     How to safely remove embedded newlines


Data analysis examples

Online shopping and a one2many tweak (2022-02-23)
     How to group product purchases by customer

Are you 10000 days old yet? (2022-01-05)
     Three command-line ways to find out

Batch triangulation on the command line (2021-06-09)
     Locate a point given the distances to two other, located points

Hunting Excel date twins (2021-03-09)
     Microsoft's choice of starting dates leads to duplicate records

The myth of equinoctial gales (2020-10-14)
     Real-world wind data don't show equinoctial gales

What's wrong with these records? (2020-08-26)
     Tinkering with "present in these records, absent in those"

Checking date components across fields (2020-04-15)
     Does "date" agree with "year", "month" and "day"?

Life tables (2020-03-11)
     A sober look at the probability of dying in Australia

Data quality in iNaturalist downloads (2020-02-05)
     Top marks for data from the citizen-science iNaturalist project

Steady as she goes, in Darwin (2019-10-25)
     The daily temperatures in Darwin (Australia) are remarkably constant

Two ugly CSVs (2019-04-28)
     Open but messy data from the Australian Electoral Commission and Companies House

Dog and cat data (2019-03-31)
     A command-line exploration of five public datasets

Data with bulges (2019-03-10)
     Three cases of unexpectedly large values in a data item

Two special data validations (2019-03-03)
     Is that tree correctly located? Is that list of names and addresses truly regular?

Drugs on the command line (2019-01-06)
     A disappointing dive into drugs data from the US Food and Drug Administration

Has the rainfall pattern in my hometown changed? (2018-12-23)
     No obvious trends in number, length or intensity of rainfall events in recent years

Fun with BOM data (2018-07-11)
     Weather watching with wget and gnuplot

Pivoting airlines (2018-06-03)
     Using arrays of arrays to build a pivot table with AWK


AWK tips and tricks

How to use patsplit (GNU AWK) (2022-02-02)
     Another way to split a string with AWK

Combinations from 2 lists: speed trials (2021-12-01)
     Comparing two ways to build Cartesian products

Building a molar mass calculator (2021-03-24)
     A shell script with AWK doing the chemical formula parsing

Updating a file from a lookup table (2020-11-11)
     How to use an AWK array for lookup operations

How to use flags in AWK (revisited) (2020-10-21)
     Flags are handy for defining AWK's working range of records

The easy-going syntax of AWK commands (2020-02-26)
     AWK is flexible and tolerant in its command rules

Another surprising AWK trick (2019-12-06)
     Strings or numbers? It depends on what you're doing with them.

A muggle's guide to AWK arrays: 4 (2019-09-20)
     Easier and more flexible ways to sort array outputs

A muggle's guide to AWK arrays: 3 (2019-08-23)
     Reformatting and table joining using arrays

A muggle's guide to AWK arrays: 2 (2019-07-12)
     Working with two files, or the same file twice

A muggle's guide to AWK arrays: 1 (2019-06-07)
     Array naming, index strings and value strings

A surprising AWK trick (2018-05-27)
     A clever way to avoid using a flag in AWK


BASH tips and tricks

Put an editable command at the next prompt (2021-09-08)
     Two ways to send an unfinished command to a prompt

How to bookmark directories in the shell (2020-06-10)
     A couple of functions is all it takes

Brace expansion with variables and arrays: eval to the rescue (2020-04-22)
     eval, a BASH built-in, solves brace expansion problems

Getting around a subshell problem (2020-01-15)
     Something strange happens with buffering in a subshell

Working around the BASH brace expansion rule (2019-06-14)
     How to build Cartesian string products in BASH

The magic of BASH string expansion (2019-05-19)
     A simple trick that allows AWK and sed to use BASH as an interpreter

Avoiding senior moments with command-line functions (2018-11-13)
     The trick is to make the documentation available on the CLI


Useful programs for command-line data ops

Revisiting a command-line translator (2021-07-28)
     A handy tweak for the translate-shell program

VisiData: a table explorer for the terminal (2019-10-11)
     Display, sort, reformat and more with this CLI utility

Transpose, pivot and bin with GNU Datamash 1.4 (2019-05-24)
     Do complex data transformations more easily with Datamash

Parsing scientific names (gnparser) (2019-01-20)
     Scientific names are much harder to parse than personal names


Data entry and display

Mapping with gnuplot, part 3 (2022-04-06)
     How to create and animate "layers" on a gnuplot basemap

Mapping with gnuplot, part 2 (2022-03-30)
     How to build a good-quality, fixed-scale basemap with gnuplot

An AWK histogram with scaling (2021-09-22)
     In these histograms, bar length is scaled to the longest bar length

CSV viewers for CSV haters (2021-08-18)
     Two CLI tools and one GUI

Visualising data as a PGM image (2021-08-04)
     A not-very-successful experiment

A sunset surprise (2021-02-17)
     Data graphics help to explain a puzzling phenomenon

Changing TTY prompt, font and colors (2020-02-19)
     How to prettify your virtual terminals

Data validation on entry with YAD (2019-11-29)
     In praise of lookup lists for data entry, with help from YAD dialogs

Plotting data in the terminal with gnuplot (2019-06-21)
     A separate graphic is much better than an in-terminal plot

Making pictures with data (2019-04-14)
     How to display data bytes as image bytes

Mapping with gnuplot (2018-10-31)
     How to use gnuplot to put data points on a basemap

How to enter nothing in a database (2018-10-18)
     If you have nothing to say, say nothing

Displaying data from table fragments (2018-09-06)
     One way to build a tidy table from a jumble of data

A record pager built with YAD (2018-08-18; updated 2018-09-09)
     How to turn a YAD dialog into a GUI viewer/pager for records in a data table

GUI ways to view and edit big text files (2018-07-31)
     glogg, gvim, Geany and csvpad, but not spreadsheets

YAD repeat and edit (2018-05-21)
     How to avoid re-entering data in a YAD data entry form


The Windows and spreadsheet worlds

Apple + Microsoft = character confusion (2022-02-09)
     Saving a .docx to plain text can fail in odd ways

Spreadsheet annoyance no. 3: quotes have priority (2021-01-20)
     Beware of unmatched quotes in data items

A grizzle about captive data (2020-07-22)
     Don't confuse data with the Windows software that contains it

Spreadsheet annoyance no. 2 (2019-04-21)
     Spreadsheets make dates out of entries that aren't dates, but that's not all they mess up

The trouble with Windows CRLF (2019-03-17)
     Windows line endings are in a pain in the ... terminal

Getting data out of Excel safely (2019-02-10)
     Watch out for embedded linebreaks, comma problems and character encoding issues

Curse of the CSV monster (2018-07-18)
     CSV with broken records to TSV


Miscellaneous stuff

DNA-style frameshift cryptography (2022-02-16)
     Secret messaging inspired by biology

Scripting a temperature notifier (2022-01-26)
     How cold did it get last night, and how cold is it now?

The data worker's guide to psiphiorrhea (2021-07-21)
     Too many decimal places? There's a name for that

The little museum and its data (2021-06-01)
     Love affairs between science and IT don't always end well

A short rant about Python, R and UNIX (2020-10-28)
     Why would you clean data with Python or R?

A data table thousands of years old (2020-08-12)
     Modern record-keeping in ancient Mesopotamia

Second Tuesday of each month and a BASHing data century (2020-03-25)
     ncal and the 100th blog post

Msot popele can undreatnsd tihs setennce (2019-12-20)
     Garbling and ungarbling with shell scripts

Python and shell tools (2019-11-22)
     A comparison of three data operations

A command-line "Countdown" (UK) companion (2019-09-27)
     Fast solver for anagram puzzles, and a puzzle generator

Getting data from an Enphase Envoy S (2019-09-06)
     Two user-accessible JSON files with performance data

Data on clay (2018-09-20)
     Cheap data storage for thousands of years? Check.
     Ancient glyphs in your terminal? Check.