
About this blog
BASHing data was a companion blog to A Data Cleaner's Cookbook. The blog was a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.
All 200 BASHing data posts (2018-2022) and version 3 of A Data Cleaner's Cookbook have been archived in Zenodo and can be downloaded for offline use. Links between the blog and the Cookbook are all local in the archived versions, so you can use both resources without needing to go online.
About me
I'm a data auditor and retired zoologist.
Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com
The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Topic categories:
- Data auditing, cleaning and processing
- Characters and encoding
- Data formatting
- Data analysis examples
- AWK tips and tricks
- BASH tips and tricks
- Useful programs for command-line data ops
- Data entry and display
- The Windows and spreadsheet worlds
- Miscellaneous stuff
Posts by category (most recent post first):
Data auditing, cleaning and processing
People are the best data cleaners (2022-04-08)
Between spreadsheets and Big Data analytics is the command line
Search for (exact) strings; report line, column and context (2022-03-09)
A coloured grep for data tables
Detecting truncations: another sometimes successful method (2021-12-15)
This is a difficult job and every command-line trick helps
A quick cross-file comparison with AWK (2021-11-10)
AWK neatly does a tricky data comparison
Duplicate records differing only in unique identifiers - updated (2021-10-27)
A much-improved method for finding these partial duplicates
Some regex tests with grep, sed and AWK (2021-10-20)
Speed tests for different search/filter cases
How to do replacements based on multiple field values (2021-10-06)
Command-line repairs with a powerfully simple tool
There's data missing - please explain (2021-06-30)
A blank entry can have hidden meanings
The curious world of check digits (2021-06-16)
How they work, and code to validate an ABN
The Incrementing Fill-Down Error (2021-05-26)
Another data crime with spreadsheeting as the prime suspect
A data checker's checklist (2021-05-12)
A draft outline of topics for the next online resource
How to fix "one2many" data issues (2021-03-17)
Command-line repairs for a surprisingly common type of error
DIY primary/foreign key relationships, again — updated (2021-03-25)
A script to check for primary/foreign key issues
Four kinds of data anomalies (2021-02-24)
Anomalies might be out of range, out of place, out of match or out of date
How to find the missing parts of a series (2021-02-03)
Command-line solutions for a simple and two more complicated cases
How to build a multi-file fields concordance (2020-12-23)
Clearly show which fields have the same name in two or more files
Check the day of year, given a date (2020-11-18)
Comparing ISO 8601 dates with their day numbers
How to keep an eye on field numbers (2020-11-04)
Put the field numbers on a digital Post-it note with YAD
Three kangaroos in the ocean (2020-09-30)
Ridiculous outliers can sometimes be worth salvaging
Finding one-to-many entries in a data table (updated) (2020-09-16)
Too many B's for each A?
Checking DIY primary/foreign key relationships (2020-09-02)
Problems when primary and foreign keys are hand-built
How to do a both/neither/one/other tally - updated (2020-09-06)
A simple check on paired fields (like latitude and longitude) in a data table
How to find almost-duplicates (2020-07-01)
Two methods that work with some (but not all) data tables
Add an issues field to a data table (2020-05-20)
How to get records to self-report their problems
Spellchecking scientific names on the command line (2020-05-06)
How to build and use a dictionary of scientific names
Targeted string replacements with sed and AWK (2020-04-08)
Avoid the dangers in globally replacing A with B
A curious pair of data ops (2020-03-18)
Multiple pivots and keying the unreadable
Moving averages with AWK (2020-03-04)
A command for adding moving averages to a table
Topping and tailing, and the slowness of GNU sort — updated (2019-11-08)
GNU sort can be a rate-limiting step in a pipeline
How to guess the field separator in a table (2019-10-04)
Count up the likely field separators in the header line with AWK
Long, narrow tables vs short, wide ones (2019-08-16)
Three tests of processing speed show that table shape doesn't matter
A bulk replacement GUI with YAD (2019-08-02)
A shell script for "normalising" pseudo-duplicates in a data table
Finding malformed markup (2019-07-19)
How to identify messed-up HTML tags in non-HTML documents
Leading and trailing whitespace (2019-06-28)
How to find and delete "fore and aft" whitespace within fields in a data table
Growing the Cookbook's "broken" function (2019-05-31)
A more informative way to tally up the number of fields in a data table
How to delete, insert and replace whole lines (2019-05-12)
Use line addresses to target just the right lines
How to delete, insert and replace whole fields (2019-05-05)
Cut and paste are usually the right tools for these jobs
Comparing fields across two tables (2019-02-03; updated)
A script to check for changes in a field
How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons? (2018-12-16)
On the command line, you can ignore everything but the numbers
Finding changepoints in a list, revisited (2018-12-06)
Using AWK to find where values change in a list
How to find distances between lat/lons for geochecking (2018-11-07)
When you're looking for big differences, an approximate method is fine
Bird watching with AWK and grep (2018-10-24)
Showing off the fastest way to search a text file for strings in another file
How to validate ISO 8601 dates without regex (2018-10-05)
Check for format and content errors in YYYY-MM-DD fields with AWK
Fightin' fields (2018-09-30)
Finding disagreements between data fields can be challenging
Fuzzy matching in practice (2018-09-23)
Tips for approximate matching with tre-agrep
48 sea levels and a trope for your terminal (2018-08-11)
A bulk string replacement with AWK, and that ACCESS DENIED thing
Pseudo-blank ("empty") records and fields (2018-08-04)
How to find not-quite-empty rows and columns in a data table
Time series ops (2018-07-23)
Using AWK to summarise time series data
Partial duplicates (2018-07-14)
One way to find "pseudoduplicated" records
Truncated data items (2018-07-04)
Detecting truncations, such as a 100-character string clipped to 50 characters
Compare parts of strings (2018-05-22)
How to use AWK's "split" function to compare parts of strings
Characters and encoding
Gremlin detection bigly improved and a NUL problem avoided (2021-12-08)
The gremlin detector script has been rebuilt from scratch
How to watermark a UTF-8 plain text file (2021-11-24)
Use an inconspicuous Unicode character and a placement code
How to find mixed Latin+Cyrillic words (2021-09-29)
Blue Latin and red Cyrillic letters in words containing both
Show Unicode code points for UTF-8 characters (2021-09-15)
Convert a character to its code point (\uxxxx) with shell tools
Yet another gremlin: the zero-width space (2021-09-01)
How to find and kill it before it does mischief
What is +ACY- doing in the data? (2021-07-14)
A strange encounter with UTF-7
Mojibake madness (2021-05-19)
Spectacular examples of gibberish from recent data audits
Mojibake bonanza (2020-12-16)
New mojibake origin puzzles from a museum database
Encoding detection smackdown (2020-09-23)
enca vs file vs iconv vs isutf8 vs uchardet
Character equivalence classes 2: the nature of equivalence (2020-06-24)
What does "something like" actually mean?
Character equivalence classes 1: search and replace (2020-06-17)
How to find "something like" a character
More mojibake fun (2020-04-01)
Easy-to-hard examples of translating from gibberish
Hunting gremlins (2020-01-22)
A script to make invisible gremlin characters visible
Build your own character class inventories — updated (2019-12-27)
Find out what [:alpha:] and [:cntrl:] mean in your system
Introducing the replo (2019-11-01)
Character replacements by computers can be reversible, reconstructable or researchable
An unexpected character replacement (2019-10-18)
Strange replacements of non-ASCII characters by R
Return of the mojibake detective (2019-07-05)
Three new cases of mysterious character corruptions
Quotes as characters (2019-04-07; updated 2019-05-26)
How to recognise the nine different kinds of single and double quotes
How to choose special characters, revisited (2019-03-24)
Scripting a little GUI for copying/pasting your most often-used special characters
iconv and illegal input sequences (2018-09-13)
Getting around a roadblock in changing the character encoding of a file
SCI and 62;c62;c62;c... (2018-08-25)
A control character causes strange behaviour in GUI terminals
Mojibake detective work (2018-08-06)
A close look at some character encoding problems
Question marks that aren't really question marks (2018-07-27)
Some question marks show that a program doesn't understand a character's encoding
Combo characters (2018-06-09)
How to deal with Unicode's combining characters
Data formatting
gron the JSON flattener (2022-03-23)
Flattened JSON can be worked with shell tools
How to flatten ("unpivot") a data table (2022-03-16)
Make a table into a list of values by row and column
Auto-incrementing version letters (2022-03-02)
Solutions for building 101c, 101d, 102a, 102b...
A dog-cat-horse-turtle problem (2022-01-19)
Seven solutions and counting for this one problem
Tidy tables for data processing (2022-01-12)
Preparing data for programs that don't care about "pretty"
Building an ODT on the command line (2021-12-29)
Bypass the GUI by starting with HTML
Making a transect into a point and circle (2021-12-22)
Convert a WKT linestring to point-radius in metres
What's wrong with my footprintWKT? (2021-11-17)
About WKT and unexplained "invalid" results in GBIF
On visual contrast and QR codes (2021-11-03)
Boosting contrast makes blurry QR codes readable
TSV to CSV on the CLI (if you really have to) (2021-10-13)
How to build an RFC4180-compliant CSV from a TSV
zbarimg and blurry QR codes (2021-08-25)
Surprisingly well-blurred codes are still readable
Two data formatting tweaks (updated) (2021-08-11)
Handy ways to make tab-separated fields more obvious
Reverse or shuffle a string in a particular field (2021-07-07)
Shell tools or AWK can do this, or a mix
"Firstname Lastname" to "Lastname, Firstname", with complications (2021-06-23)
Name parsing and formatting is rarely simple
CSV to table, table to CSV (2021-06-02)
How to pivot and "de-pivot" a CSV table
Converting a list to a presence/absence table (2021-02-10)
Re-formatting is easy with tidy, well-structured data
ASCII score bars and a gorblimey command (2021-01-27)
How to build a string of characters and their complement
Form text and placeholders (2021-01-13)
Form letters, diaries and mail merge in plain text
Comparing strings more clearly (2020-12-09)
How to make and emphasise a string comparison between fields
Re-format blah,YYYYMMDD,blah as blah,YYYY,MM,DD,blah (2020-12-02)
How to do it with sed or AWK: 7 methods
How to stack columns (2020-11-25)
Turn a "columnated" table into a straight up-and-down one
Building a data table from a sentence (2020-10-07)
How to expand a condensed data structure
Spotting spaces, and AWK's view of emptiness (2020-09-09)
A simple way to show and count plain whitespaces,
and "non-empty" vs "non-empty and non-zero" in AWK
How to number copy/pasted commands (2020-08-05)
A neat way to number and indent commands and their outputs
Sharing data and metadata together (2020-07-29)
How not to lose a data table's metadata
A quick repair job on a dislocated table (2020-07-15)
Fixing a table with displaced fields
Extra commas in a CSV (2020-07-08)
How to safely delete just the excess commas
Join consecutive lines if condition applies (2020-06-03)
Simple ways to fix embedded newlines
Printing repeats within repeats, and splitting a list into columns (2020-05-27)
Why I use pr rather than column for some columnating jobs
How to move selected lines within a file (2020-05-13)
No need to cut and paste, use the command line
Dealing with an all-CAPS/first-CAP jumble (2020-04-29)
How to normalise a mix of WORDS and Words
How to be uncertain with dates (2020-02-12)
A skeptical look at some of ISO 8601's new extensions
JSON Lines: record-style JSON (2020-01-29)
A bridge between table-style data and standard JSON
Emphasising text in the terminal (2019-12-13)
Making selected strings stand out with ANSI codes
Embedded newlines without a clue (2019-11-15)
Without clear markers for field fragments, you need to be creative
Add leading zeroes that aren't really leading (2019-09-13)
How to format numbers when they're inside non-numeric strings
A GUI to re-order fields in a table (2019-08-30)
A shell script for building a new table with reordered fields
The lat/lon floating point delusion (2019-08-09)
That big building is at -33.8903169365705 151.198409720645? Really?
Renumber a list after inserting a line — updated (2019-07-27)
A handy function for inserting and renumbering
Data from dingbats: copying down (2019-02-24)
Copying down is easy in a spreadsheet, but it's also possible on the command line
Fancy numbering of records (2019-02-17)
On the command line, you can number a list of records any way you like
Reformatting a list, cleverly (2019-01-27)
Create horizontal lists from a vertical one
Horizontal sorting within a field (2019-01-13)
There are two different ways to sort a field "horizontally", but neither of them is simple.
Changing the month format: a fairly general solution (2018-12-30)
Build a look-up table and use the starting and finishing format in an AWK command
Putting information into a table from the table's filename (2018-12-13)
The example adds a date from the filename to each record in the table
Unwrap your fasta (2018-12-01)
How to concatenate the sequence lines in FASTA files
Repair job: separate the tandem repeats (2018-10-26)
How to split a tandem repeat between fields
Too many lat/lon digits (2018-06-30)
Rounding off latitude/longitude data to an appropriate number of significant figures
Embedded newlines (2018-06-23)
How to safely remove embedded newlines
Data analysis examples
Online shopping and a one2many tweak (2022-02-23)
How to group product purchases by customer
Are you 10000 days old yet? (2022-01-05)
Three command-line ways to find out
Batch triangulation on the command line (2021-06-09)
Locate a point given the distances to two other, located points
Hunting Excel date twins (2021-03-09)
Microsoft's choice of starting dates leads to duplicate records
The myth of equinoctial gales (2020-10-14)
Real-world wind data don't show equinoctial gales
What's wrong with these records? (2020-08-26)
Tinkering with "present in these records, absent in those"
Checking date components across fields (2020-04-15)
Does "date" agree with "year", "month" and "day"?
Life tables (2020-03-11)
A sober look at the probability of dying in Australia
Data quality in iNaturalist downloads (2020-02-05)
Top marks for data from the citizen-science iNaturalist project
Steady as she goes, in Darwin (2019-10-25)
The daily temperatures in Darwin (Australia) are remarkably constant
Two ugly CSVs (2019-04-28)
Open but messy data from the Australian Electoral Commission and Companies House
Dog and cat data (2019-03-31)
A command-line exploration of five public datasets
Data with bulges (2019-03-10)
Three cases of unexpectedly large values in a data item
Two special data validations (2019-03-03)
Is that tree correctly located? Is that list of names and addresses truly regular?
Drugs on the command line (2019-01-06)
A disappointing dive into drugs data from the US Food and Drug Administration
Has the rainfall pattern in my hometown changed? (2018-12-23)
No obvious trends in number, length or intensity of rainfall events in recent years
Fun with BOM data (2018-07-11)
Weather watching with wget and gnuplot
Pivoting airlines (2018-06-03)
Using arrays of arrays to build a pivot table with AWK
AWK tips and tricks
How to use patsplit (GNU AWK) (2022-02-02)
Another way to split a string with AWK
Combinations from 2 lists: speed trials (2021-12-01)
Comparing two ways to build Cartesian products
Building a molar mass calculator (2021-03-24)
A shell script with AWK doing the chemical formula parsing
Updating a file from a lookup table (2020-11-11)
How to use an AWK array for lookup operations
How to use flags in AWK (revisited) (2020-10-21)
Flags are handy for defining AWK's working range of records
The easy-going syntax of AWK commands (2020-02-26)
AWK is flexible and tolerant in its command rules
Another surprising AWK trick (2019-12-06)
Strings or numbers? It depends on what you're doing with them.
A muggle's guide to AWK arrays: 4 (2019-09-20)
Easier and more flexible ways to sort array outputs
A muggle's guide to AWK arrays: 3 (2019-08-23)
Reformatting and table joining using arrays
A muggle's guide to AWK arrays: 2 (2019-07-12)
Working with two files, or the same file twice
A muggle's guide to AWK arrays: 1 (2019-06-07)
Array naming, index strings and value strings
A surprising AWK trick (2018-05-27)
A clever way to avoid using a flag in AWK
BASH tips and tricks
Put an editable command at the next prompt (2021-09-08)
Two ways to send an unfinished command to a prompt
How to bookmark directories in the shell (2020-06-10)
A couple of functions is all it takes
Brace expansion with variables and arrays: eval to the rescue (2020-04-22)
eval, a BASH built-in, solves brace expansion problems
Getting around a subshell problem (2020-01-15)
Something strange happens with buffering in a subshell
Working around the BASH brace expansion rule (2019-06-14)
How to build Cartesian string products in BASH
The magic of BASH string expansion (2019-05-19)
A simple trick that allows AWK and sed to use BASH as an interpreter
Avoiding senior moments with command-line functions (2018-11-13)
The trick is to make the documentation available on the CLI
Useful programs for command-line data ops
Revisiting a command-line translator (2021-07-28)
A handy tweak for the translate-shell program
VisiData: a table explorer for the terminal (2019-10-11)
Display, sort, reformat and more with this CLI utility
Transpose, pivot and bin with GNU Datamash 1.4 (2019-05-24)
Do complex data transformations more easily with Datamash
Parsing scientific names (gnparser) (2019-01-20)
Scientific names are much harder to parse than personal names
Data entry and display
Mapping with gnuplot, part 3 (2022-04-06)
How to create and animate "layers" on a gnuplot basemap
Mapping with gnuplot, part 2 (2022-03-30)
How to build a good-quality, fixed-scale basemap with gnuplot
An AWK histogram with scaling (2021-09-22)
In these histograms, bar length is scaled to the longest bar length
CSV viewers for CSV haters (2021-08-18)
Two CLI tools and one GUI
Visualising data as a PGM image (2021-08-04)
A not-very-successful experiment
A sunset surprise (2021-02-17)
Data graphics help to explain a puzzling phenomenon
Changing TTY prompt, font and colors (2020-02-19)
How to prettify your virtual terminals
Data validation on entry with YAD (2019-11-29)
In praise of lookup lists for data entry, with help from YAD dialogs
Plotting data in the terminal with gnuplot (2019-06-21)
A separate graphic is much better than an in-terminal plot
Making pictures with data (2019-04-14)
How to display data bytes as image bytes
Mapping with gnuplot (2018-10-31)
How to use gnuplot to put data points on a basemap
How to enter nothing in a database (2018-10-18)
If you have nothing to say, say nothing
Displaying data from table fragments (2018-09-06)
One way to build a tidy table from a jumble of data
A record pager built with YAD (2018-08-18; updated 2018-09-09)
How to turn a YAD dialog into a GUI viewer/pager for records in a data table
GUI ways to view and edit big text files (2018-07-31)
glogg, gvim, Geany and csvpad, but not spreadsheets
YAD repeat and edit (2018-05-21)
How to avoid re-entering data in a YAD data entry form
The Windows and spreadsheet worlds
Apple + Microsoft = character confusion (2022-02-09)
Saving a .docx to plain text can fail in odd ways
Spreadsheet annoyance no. 3: quotes have priority (2021-01-20)
Beware of unmatched quotes in data items
A grizzle about captive data (2020-07-22)
Don't confuse data with the Windows software that contains it
Spreadsheet annoyance no. 2 (2019-04-21)
Spreadsheets make dates out of entries that aren't dates, but that's not all they mess up
The trouble with Windows CRLF (2019-03-17)
Windows line endings are in a pain in the ... terminal
Getting data out of Excel safely (2019-02-10)
Watch out for embedded linebreaks, comma problems and character encoding issues
Curse of the CSV monster (2018-07-18)
CSV with broken records to TSV
Miscellaneous stuff
DNA-style frameshift cryptography (2022-02-16)
Secret messaging inspired by biology
Scripting a temperature notifier (2022-01-26)
How cold did it get last night, and how cold is it now?
The data worker's guide to psiphiorrhea (2021-07-21)
Too many decimal places? There's a name for that
The little museum and its data (2021-06-01)
Love affairs between science and IT don't always end well
A short rant about Python, R and UNIX (2020-10-28)
Why would you clean data with Python or R?
A data table thousands of years old (2020-08-12)
Modern record-keeping in ancient Mesopotamia
Second Tuesday of each month and a BASHing data century (2020-03-25)
ncal and the 100th blog post
Msot popele can undreatnsd tihs setennce (2019-12-20)
Garbling and ungarbling with shell scripts
Python and shell tools (2019-11-22)
A comparison of three data operations
A command-line "Countdown" (UK) companion (2019-09-27)
Fast solver for anagram puzzles, and a puzzle generator
Getting data from an Enphase Envoy S (2019-09-06)
Two user-accessible JSON files with performance data
Data on clay (2018-09-20)
Cheap data storage for thousands of years? Check.
Ancient glyphs in your terminal? Check.