About this blog

This is the second series (2024 >) of the BASHing data blog. The first series of 200 posts (2018-2022) and this one are companion websites to A Data Cleaner's Cookbook. Like the first series, the current blog is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.

The first BASHing data series and A Data Cleaner's Cookbook are still online, but they are also archived in Zenodo and can be downloaded for offline use. The first 75 posts in this second BASHing data series are likewise archived in Zenodo.

This website has an feed.

About me

I'm a data auditor and retired zoologist.

Robert Mesibov, West Ulverstone, Tasmania, Australia
mesibov@datafix.com.au

The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Topic categories:

Data auditing, cleaning and processing
Characters and encoding
Data formatting
Data analysis examples
Data entry and display
Useful programs for command-line data ops
AWK tips and tricks
BASH tips and tricks
Miscellaneous stuff

Posts by category (most recent post first):

Data auditing, cleaning and processing

Serial numbering based on changing values in another field (2025-12-12)
Add a field with numbers that aren't simply serial

Similar to "most distant pair of points", sort of (2025-11-28)
Doing a spatial data check without GIS

A data table full of ghosts (2025-10-31)
The "blanks" are not what they seem

Four ways to prepend a line of text (2025-10-10)
Try it with sed or cats

Filling down blanks in multiple fields (2025-08-29)
A spreadsheet can't do this in one pass, but AWK can

Converting lat/lon from DMS to DD without screaming (2025-08-22)
How to ignore everything but numbers and letters

Data noise (2025-08-08)
Annoying problems in apparently "tidy" text files

GNU sed's handy -z option (2025-07-25)
Numbered-occurrence replacements on whole files, not just lines

Does string A contain string B? Ask AWK's index (2025-02-28)
Demonstrating a handy use for the index function

Extract the year from a date string without using the date command
(2025-02-21)
Demonstrating 2 methods and 4 variations for each

Adding the missing keys and values in a key-value series (2025-01-24)
Easily done with an AWK array

Replace the last N occurrences of a pattern in a string (2025-01-03)
Drive sed with a for loop

Numbering duplicates by appearance order and date order (2024-12-20)
How to violate the "one-field-one-kind-of-information" principle of databasing

Another embedded newlines fix (2024-11-29)
If all records begin with the same string, there's an interesting AWK solution

Merging tables with (some) shared fields (2024-11-15)
Get the fields into alignment with datamash and the join command

Timing a CSV to TSV operation (2024-11-01)
How to quickly and easily compare process times?

Documenting edits with a before-and-after report (2024-09-13)
A tweak to make the output more informative

Find the first, last, nth and first+last occurrence of a string (2024-06-21)
Showing the easiest ways I know to do these jobs

Extract successive pairs from a list, and rapidly grow a list (2024-05-03)
How to do it, but be careful with the "yes" command

Post- and pre-incrementing (var++ and ++var) with AWK (2024-04-26)
Pre or post? Sometimes it doesn't matter

Finding near-duplicate spelling variants (2024-04-05)
How to search for ä/ae-type duplicates

Table in a PDF to a TSV, on the command line (2024-03-29)
Use the pdftotext utility and clean up with sed and AWK

Finding identifier codes with and without extra characters (2024-02-02)
A command-line solution for finding near-duplicate values

Characters and encoding

THE escape character, not AN escape character (2025-11-21)
Different ways to enter U+001B

Mojibake detective: the case of the Greek claw (2025-10-17)
Microsoft is again the chief suspect

My shell and my browser don't understand each other (2025-09-26)
A copy-paste script to convert characters to their HTML versions

Beware these characters in a terminal (2025-06-20)
Really annoying behaviour for CLI users

How to hide a number in plain sight (2025-06-13)
A simple cryptographic trick

The ìèñëèâñüêå mystery (2025-01-31)
The killer was... the Microsoft Corporation

A Unicode normalisation problem (2025-01-10)
How to get rid of full-width characters

The Web's most familiar gibberish: â€™ (2024-11-22)
Unfortunately, it isn't going to go away anytime soon

Mojibake, anyone? (2024-07-19)
More delightful examples from real-world data audits

How to detect and convert those baffling ruffians (2024-06-28)
Beware of Latin ligatures

A text full of nulls - what happened? (2024-06-07)
Hint: Microsoft Windows encoding

Print a character as a variable with BASH printf (2024-03-22)
There's a right way and a wrong way, but both work

Counterfeit spaces: the NBSP menace (2024-03-01)
How to visualise and replace (or delete) NBSPs

Mojibake with 2 hearts and 52 bytes (2024-02-09)
Encoding ping-pong between UTF-8 and Windows-1252

Data formatting

Format musings 3; the last "BASHing data" post (2025-12-26) ⇜ LATEST
What's a "LML"?

Format musings 2: CSV vs KVR (key-values in rows) (2025-11-14)
Another way to store tabular data

Not on the keyboard. How to type it? (2025-11-07)
You can type special characters directly (almost)

The difference between two dates: easy solutions and hard (2025-10-24)
Durn, that's a handy function

Another tricky formatting problem (2025-09-19)
How to merge lines, differently

Format musings 1: NestedText and indentation (2025-07-18)
A data format that relies on indentation could be risky

Multiple-line records to a simple table (2025-05-16)
A single AWK command regularises an irregular set of records

How to add trailing spaces and zeroes (2025-04-25)
Spaces easy, zeroes tricky

Extreme reformatting: a vertical calendar (2025-04-18; updated 2026-03-30)
It took a surprising amount of work to build vertically

Text processing with xargs and jot (2025-03-28)
Demonstrating niche uses for these two utilities

Munging the Atlas of Living Australia table format (2024-12-06)
Why is the header in a separate file?

USV: The Unicode Separated Values format (2024-10-11)
It's new and interesting

Line spacing tricks - updated (2024-07-12/2024-09-03)
sed, AWK and grep are your friends

Archiving images: TIFF vs PPM (2024-07-04)
Which format will be more easily readable in 1000 years?

DataMatrix codes and data content (2024-04-19)
Squeezing lots of information into a tiny graphic

CSV to JSON to CSV, awkwardly (2024-04-12)
Recovering CSV data from an awful JSON file

Convert Microsoft serial day numbers to YYYY-MM-DD (2024-02-23)
Easy, if you remember that 1900-02-29 didn't happen

Data analysis examples

Where there's a shell, there's a (usually simpler) way (2025-12-05)
There are n different ways to code a solution, and n can be large

5, 7, 8, 9, 10, 12, 14, 15, 17. Any advance on 17? (2025-09-05)
Agreement on missing data, but what do about it?

How to ignore everything but numbers (2025-05-23)
If AWK sees a number first, it thinks arithmetically

Find all data points "X" km or less from a given point (2025-03-14)
A command-line alternative to working with a GIS program

Permutations and combinations of pairs with AWK (2025-03-07)
Easy ways to get results with and without repetition

Anatomy of a data analysis (2024-10-25)
5 million records dissected with BASH arrays

Summing by type in a table (2024-09-20)
What to do if the table layout is awful

Minimum, maximum and range by group (2024-05-24)
GNU datamash is great, but sometimes more is needed

Data entry and display

Data entry with unknown data categories (2025-07-04)
What to do when you don't know the fields in advance

Rename time-series files for chronological sorting (2025-04-04)
Dealing with the "document15176075861143268989.pdf" problem

Four exercises with data art (2025-02-14)
Colorful fun with PPMs

Pretty-printing a table in the terminal - updated (2024-11-08; 2025-02-21)
Three little-known CLI programs and a tip about less

A plotting-in-terminal solution: sixels and mlterm (2024-10-04)
With some terminals, sixel graphics are wonderfully easy to use

Millipedes and maps (2024-08-16)
A script to automate some map-making for the Web

Searching a pick-list with YAD (2024-08-09)
YAD can display form options from a list

Middle-click paste a series of numbers or letters (2024-07-26)
A neat trick that might be handy someday

Mapping with gnuplot, part 5 (2024-03-15)
Building a dialog for choosing data to be mapped

Mapping with gnuplot, part 4 (2024-03-08)
Showing a much-improved way to build a basemap

Useful programs for command-line data ops

Too many keyboard shortcuts to remember easily? (2025-10-03)
One keyboard shortcut to rule them all

Square root days, prime years and maximum-factor years (2025-08-15)
How to make the "factor" utility work on time

zet for sets (2025-08-01)
Unions, intersects etc without changing line order

Making an archive job a lot easier (2025-06-06)
Selectively unzip and rename, all on the command line

Two more tweaks for the ranger file manager (2025-05-09)
Wrap text in preview, and improve the default colors

csvlens: a delimited text file viewer for the terminal (2025-05-02)
TL, DR: it works very well as a viewer!

New code for my translation box (2025-03-21)
Translations on the fly, item by item

MAD about the median (2025-01-17)
That's Median Absolute Deviation, a useful statistic

7 ways to get the source code of a webpage (2024-09-27)
Not the same as Web scraping

Escaping from Microsoft Excel on the command line (2024-08-30)
xlsx2csv, in2csv, ssconvert and unoconv

How to crunch a grawlix (2024-08-02)
Demonstrating an unusual use for crunching

Five useful tweaks for the ranger file manager (2024-06-14)
Easy ways to make this CLI utility even better

Polyglot and round-trip translations (2024-05-31)
Flexible translations with translate-shell

GNU datamash and months (2024-02-16)
How to help datamash over the month-sorting hurdle

AWK tips and tricks

Five ways to pass a shell variable to AWK (2025-06-27)
Two simple ways and three clever tricks

AWK's view of existence (2025-02-07)
Empty vs zero: beware the difference

How to force a preferred array sort in AWK (2024-10-18)
Use a second array to control the first

Find a word, plus words either side of the matching word (2024-08-23)
It might be easier with AWK than with grep

AWK one-liners to multi-liners (2024-05-10)
A little-known "pretty print" option

BASH tips and tricks

Copy selected items from a terminal to a text file (2025-12-19)
Scripting the job makes it fast and easy

The script command for tinkerers (2025-07-11)
A handy way to record command history

A launcher for occasionally used applications (2025-04-11)
A DIY desktop tool for speed and efficiency

Sorting camels, kebabs, pascals and snakes (2024-12-13)
Word case has a powerful effect on sorting

Miscellaneous stuff

Change of habit: Geany out, Mousepad in (2025-09-12)
Moving to a surprisingly capable text editor

What a long, strange trip it's been (2025-05-30)
Computer users were computer programmers, to begin with

The browser-as-text-editor trick (2024-12-27)
It's a simple trick and can save you having to open a separate text editor

Tools of my trades (2024-09-06)
23 GUI and 72 CLI programs I need for my work

The curious world of UUIDs (2024-05-17)
What they are and how to tinker with them