banner

About this blog

BASHing data is a companion blog to A Data Cleaner's Cookbook. This is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.

The first 100 BASHing data posts (to 2020-03-25) and version 2 of A Data Cleaner's Cookbook have been archived in Zenodo and can be downloaded for offline use. Links between the blog and the Cookbook are all local in the archived versions, so you can use both resources without needing to go online.

Want to comment?

Email me. If your remarks are on-topic and helpful, they'll be edited straight into the relevant post, not buried in a list of comments at the bottom of the webpage.

Want notice of future posts?

Copy the RSS link into your feed reader:    RSS

About me

I'm a data auditor and retired zoologist.

Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com

The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Topic categories:


Most recent post:

Finding one-to-many entries in a data table (2020-09-16)
     Too many B's for each A?


Posts by category (most recent post first):

Data auditing, cleaning and processing

Finding one-to-many entries in a data table (2020-09-16)
     Too many B's for each A?

Checking DIY primary/foreign key relationships (2020-09-02)
     Problems when primary and foreign keys are hand-built

How to do a both/neither/one/other tally - updated (2020-09-06)
     A simple check on paired fields (like latitude and longitude) in a data table

How to find almost-duplicates (2020-07-01)
     Two methods that work with some (but not all) data tables

Add an issues field to a data table (2020-05-20)
     How to get records to self-report their problems

Spellchecking scientific names on the command line (2020-05-06)
     How to build and use a dictionary of scientific names

Targeted string replacements with sed and AWK (2020-04-08)
     Avoid the dangers in globally replacing A with B

A curious pair of data ops (2020-03-18)
     Multiple pivots and keying the unreadable

Moving averages with AWK (2020-03-04)
     A command for adding moving averages to a table

Topping and tailing, and the slowness of GNU sort — updated (2019-11-08)
     GNU sort can be a rate-limiting step in a pipeline

How to guess the field separator in a table (2019-10-04)
     Count up the likely field separators in the header line with AWK

Long, narrow tables vs short, wide ones (2019-08-16)
     Three tests of processing speed show that table shape doesn't matter

A bulk replacement GUI with YAD (2019-08-02)
     A shell script for "normalising" pseudo-duplicates in a data table

Finding malformed markup (2019-07-19)
     How to identify messed-up HTML tags in non-HTML documents

Leading and trailing whitespace (2019-06-28)
     How to find and delete "fore and aft" whitespace within fields in a data table

Growing the Cookbook's "broken" function (2019-05-31)
     A more informative way to tally up the number of fields in a data table

How to delete, insert and replace whole lines (2019-05-12)
     Use line addresses to target just the right lines

How to delete, insert and replace whole fields (2019-05-05)
     Cut and paste are usually the right tools for these jobs

Comparing fields across two tables (2019-02-03; updated)
     A script to check for changes in a field

How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons? (2018-12-16)
     On the command line, you can ignore everything but the numbers

Finding changepoints in a list, revisited (2018-12-06)
     Using AWK to find where values change in a list

How to find distances between lat/lons for geochecking (2018-11-07)
     When you're looking for big differences, an approximate method is fine

Bird watching with AWK and grep (2018-10-24)
     Showing off the fastest way to search a text file for strings in another file

How to validate ISO 8601 dates without regex (2018-10-05)
     Check for format and content errors in YYYY-MM-DD fields with AWK

Fightin' fields (2018-09-30)
     Finding disagreements between data fields can be challenging

Fuzzy matching in practice (2018-09-23)
     Tips for approximate matching with tre-agrep

48 sea levels and a trope for your terminal (2018-08-11)
     A bulk string replacement with AWK, and that ACCESS DENIED thing

Pseudo-blank ("empty") records and fields (2018-08-04)
     How to find not-quite-empty rows and columns in a data table

Time series ops (2018-07-23)
     Using AWK to summarise time series data

Partial duplicates (2018-07-14)
     One way to find "pseudoduplicated" records

Truncated data items (2018-07-04)
     Detecting truncations, such as a 100-character string clipped to 50 characters

Compare parts of strings (2018-05-22)
     How to use AWK's "split" function to compare parts of strings


Characters and encoding

Character equivalence classes 2: the nature of equivalence (2020-06-24)
     What does "something like" actually mean?

Character equivalence classes 1: search and replace (2020-06-17)
     How to find "something like" a character

More mojibake fun (2020-04-01)
     Easy-to-hard examples of translating from gibberish

Hunting gremlins (2020-01-22)
     A script to make invisible gremlin characters visible

Build your own character class inventories — updated (2019-12-27)
     Find out what [:alpha:] and [:cntrl:] mean in your system

Introducing the replo (2019-11-01)
     Character replacements by computers can be reversible, reconstructable or researchable

An unexpected character replacement (2019-10-18)
     Strange replacements of non-ASCII characters by R

Return of the mojibake detective (2019-07-05)
     Three new cases of mysterious character corruptions

Quotes as characters (2019-04-07; updated 2019-05-26)
     How to recognise the nine different kinds of single and double quotes

How to choose special characters, revisited (2019-03-24)
     Scripting a little GUI for copying/pasting your most often-used special characters

iconv and illegal input sequences (2018-09-13)
     Getting around a roadblock in changing the character encoding of a file

SCI and 62;c62;c62;c... (2018-08-25)
     A control character causes strange behaviour in GUI terminals

Mojibake detective work (2018-08-06)
     A close look at some character encoding problems

Question marks that aren't really question marks (2018-07-27)
     Some question marks show that a program doesn't understand a character's encoding

Combo characters (2018-06-09)
     How to deal with Unicode's combining characters


Data formatting

Spotting spaces, and AWK's view of emptiness (2020-09-09)
     A simple way to show and count plain whitespaces,
     and "non-empty" vs "non-empty and non-zero" in AWK

How to number copy/pasted commands (2020-08-05)
     A neat way to number and indent commands and their outputs

Sharing data and metadata together (2020-07-29)
     How not to lose a data table's metadata

A quick repair job on a dislocated table (2020-07-15)
     Fixing a table with displaced fields

Extra commas in a CSV (2020-07-08)
     How to safely delete just the excess commas

Join consecutive lines if condition applies (2020-06-03)
     Simple ways to fix embedded newlines

Printing repeats within repeats, and splitting a list into columns (2020-05-27)
     Why I use pr rather than column for some columnating jobs

How to move selected lines within a file (2020-05-13)
     No need to cut and paste, use the command line

Dealing with an all-CAPS/first-CAP jumble (2020-04-29)
     How to normalise a mix of WORDS and Words

How to be uncertain with dates (2020-02-12)
     A skeptical look at some of ISO 8601's new extensions

JSON Lines: record-style JSON (2020-01-29)
     A bridge between table-style data and standard JSON

Emphasising text in the terminal (2019-12-13)
     Making selected strings stand out with ANSI codes

Embedded newlines without a clue (2019-11-15)
     Without clear markers for field fragments, you need to be creative

Add leading zeroes that aren't really leading (2019-09-13)
     How to format numbers when they're inside non-numeric strings

A GUI to re-order fields in a table (2019-08-30)
     A shell script for building a new table with reordered fields

The lat/lon floating point delusion (2019-08-09)
     That big building is at -33.8903169365705 151.198409720645? Really?

Renumber a list after inserting a line — updated (2019-07-27)
     A handy function for inserting and renumbering

Data from dingbats: copying down (2019-02-24)
     Copying down is easy in a spreadsheet, but it's also possible on the command line

Fancy numbering of records (2019-02-17)
     On the command line, you can number a list of records any way you like

Reformatting a list, cleverly (2019-01-27)
     Create horizontal lists from a vertical one

Horizontal sorting within a field (2019-01-13)
     There are two different ways to sort a field "horizontally", but neither of them is simple.

Changing the month format: a fairly general solution (2018-12-30)
     Build a look-up table and use the starting and finishing format in an AWK command

Putting information into a table from the table's filename (2018-12-13)
     The example adds a date from the filename to each record in the table

Unwrap your fasta (2018-12-01)
     How to concatenate the sequence lines in FASTA files

Repair job: separate the tandem repeats (2018-10-26)
     How to split a tandem repeat between fields

Too many lat/lon digits (2018-06-30)
     Rounding off latitude/longitude data to an appropriate number of significant figures

Embedded newlines (2018-06-23)
     How to safely remove embedded newlines


Data analysis examples

What's wrong with these records? (2020-08-26)
     Tinkering with "present in these records, absent in those"

Checking date components across fields (2020-04-15)
     Does "date" agree with "year", "month" and "day"?

Life tables (2020-03-11)
     A sober look at the probability of dying in Australia

Data quality in iNaturalist downloads (2020-02-05)
     Top marks for data from the citizen-science iNaturalist project

Steady as she goes, in Darwin (2019-10-25)
     The daily temperatures in Darwin (Australia) are remarkably constant

Two ugly CSVs (2019-04-28)
     Open but messy data from the Australian Electoral Commission and Companies House

Dog and cat data (2019-03-31)
     A command-line exploration of five public datasets

Data with bulges (2019-03-10)
     Three cases of unexpectedly large values in a data item

Two special data validations (2019-03-03)
     Is that tree correctly located? Is that list of names and addresses truly regular?

Drugs on the command line (2019-01-06)
     A disappointing dive into drugs data from the US Food and Drug Administration

Has the rainfall pattern in my hometown changed? (2018-12-23)
     No obvious trends in number, length or intensity of rainfall events in recent years

Fun with BOM data (2018-07-11)
     Weather watching with wget and gnuplot

Pivoting airlines (2018-06-03)
     Using arrays of arrays to build a pivot table with AWK


AWK tips and tricks

The easy-going syntax of AWK commands (2020-02-26)
     AWK is flexible and tolerant in its command rules

Another surprising AWK trick (2019-12-06)
     Strings or numbers? It depends on what you're doing with them.

A muggle's guide to AWK arrays: 4 (2019-09-20)
     Easier and more flexible ways to sort array outputs

A muggle's guide to AWK arrays: 3 (2019-08-23)
     Reformatting and table joining using arrays

A muggle's guide to AWK arrays: 2 (2019-07-12)
     Working with two files, or the same file twice

A muggle's guide to AWK arrays: 1 (2019-06-07)
     Array naming, index strings and value strings

A surprising AWK trick (2018-05-27)
     A clever way to avoid using a flag in AWK


BASH tips and tricks

How to bookmark directories in the shell (2020-06-10)
     A couple of functions is all it takes

Brace expansion with variables and arrays: eval to the rescue (2020-04-22)
     eval, a BASH built-in, solves brace expansion problems

Getting around a subshell problem (2020-01-15)
     Something strange happens with buffering in a subshell

Working around the BASH brace expansion rule (2019-06-14)
     How to build Cartesian string products in BASH

The magic of BASH string expansion (2019-05-19)
     A simple trick that allows AWK and sed to use BASH as an interpreter

Avoiding senior moments with command-line functions (2018-11-13)
     The trick is to make the documentation available on the CLI


Useful programs for command-line data ops

VisiData: a table explorer for the terminal (2019-10-11)
     Display, sort, reformat and more with this CLI utility

Transpose, pivot and bin with GNU Datamash 1.4 (2019-05-24)
     Do complex data transformations more easily with Datamash

Parsing scientific names (gnparser) (2019-01-20)
     Scientific names are much harder to parse than personal names


Data entry and display

Changing TTY prompt, font and colors (2020-02-19)
     How to prettify your virtual terminals

Data validation on entry with YAD (2019-11-29)
     In praise of lookup lists for data entry, with help from YAD dialogs

Plotting data in the terminal with gnuplot (2019-06-21)
     A separate graphic is much better than an in-terminal plot

Making pictures with data (2019-04-14)
     How to display data bytes as image bytes

Mapping with gnuplot (2018-10-31)
     How to use gnuplot to put data points on a basemap

How to enter nothing in a database (2018-10-18)
     If you have nothing to say, say nothing

Displaying data from table fragments (2018-09-06)
     One way to build a tidy table from a jumble of data

A record pager built with YAD (2018-08-18; updated 2018-09-09)
     How to turn a YAD dialog into a GUI viewer/pager for records in a data table

GUI ways to view and edit big text files (2018-07-31)
     glogg, gvim, Geany and csvpad, but not spreadsheets

YAD repeat and edit (2018-05-21)
     How to avoid re-entering data in a YAD data entry form


The Windows and spreadsheet worlds

A grizzle about captive data (2020-07-22)
     Don't confuse data with the Windows software that contains it

Spreadsheet annoyance no. 2 (2019-04-21)
     Spreadsheets make dates out of entries that aren't dates, but that's not all they mess up

The trouble with Windows CRLF (2019-03-17)
     Windows line endings are in a pain in the ... terminal

Getting data out of Excel safely (2019-02-10)
     Watch out for embedded linebreaks, comma problems and character encoding issues

Curse of the CSV monster (2018-07-18)
     CSV with broken records to TSV


Miscellaneous stuff

A data table thousands of years old (2020-08-12)
     Modern record-keeping in ancient Mesopotamia

Second Tuesday of each month and a BASHing data century (2020-03-25)
     ncal and the 100th blog post

Msot popele can undreatnsd tihs setennce (2019-12-20)
     Garbling and ungarbling with shell scripts

Python and shell tools (2019-11-22)
     A comparison of three data operations

A command-line "Countdown" (UK) companion (2019-09-27)
     Fast solver for anagram puzzles, and a puzzle generator

Getting data from an Enphase Envoy S (2019-09-06)
     Two user-accessible JSON files with performance data

Data on clay (2018-09-20)
     Cheap data storage for thousands of years? Check.
     Ancient glyphs in your terminal? Check.