
About this blog
BASHing data is a companion blog to A Data Cleaner's Cookbook. This is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.
The first 139 BASHing data posts (to 2020-12-23) and version 2 of A Data Cleaner's Cookbook have been archived in Zenodo and can be downloaded for offline use. Links between the blog and the Cookbook are all local in the archived versions, so you can use both resources without needing to go online.
Want to comment?
Email me. If your remarks are on-topic and helpful, they'll be edited straight into the relevant post, not buried in a list of comments at the bottom of the webpage.
Want notice of future posts?
Copy the RSS link into your feed reader:
About me
I'm a data auditor and retired zoologist.
Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com
The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Topic categories:
- Data auditing, cleaning and processing
- Characters and encoding
- Data formatting
- Data analysis examples
- AWK tips and tricks
- BASH tips and tricks
- Useful programs for command-line data ops
- Data entry and display
- The Windows and spreadsheet worlds
- Miscellaneous stuff
Most recent post:
Four kinds of data anomalies (2021-02-24)
Anomalies might be out of range, out of place, out of match or out of date
Posts by category (most recent post first):
Data auditing, cleaning and processing
Four kinds of data anomalies (2021-02-24)
Anomalies might be out of range, out of place, out of match or out of date
How to find the missing parts of a series (2021-02-03)
Command-line solutions for a simple and two more complicated cases
How to build a multi-file fields concordance (2020-12-23)
Clearly show which fields have the same name in two or more files
Check the day of year, given a date (2020-11-18)
Comparing ISO 8601 dates with their day numbers
How to keep an eye on field numbers (2020-11-04)
Put the field numbers on a digital Post-it note with YAD
Three kangaroos in the ocean (2020-09-30)
Ridiculous outliers can sometimes be worth salvaging
Finding one-to-many entries in a data table (2020-09-16)
Too many B's for each A?
Checking DIY primary/foreign key relationships (2020-09-02)
Problems when primary and foreign keys are hand-built
How to do a both/neither/one/other tally - updated (2020-09-06)
A simple check on paired fields (like latitude and longitude) in a data table
How to find almost-duplicates (2020-07-01)
Two methods that work with some (but not all) data tables
Add an issues field to a data table (2020-05-20)
How to get records to self-report their problems
Spellchecking scientific names on the command line (2020-05-06)
How to build and use a dictionary of scientific names
Targeted string replacements with sed and AWK (2020-04-08)
Avoid the dangers in globally replacing A with B
A curious pair of data ops (2020-03-18)
Multiple pivots and keying the unreadable
Moving averages with AWK (2020-03-04)
A command for adding moving averages to a table
Topping and tailing, and the slowness of GNU sort — updated (2019-11-08)
GNU sort can be a rate-limiting step in a pipeline
How to guess the field separator in a table (2019-10-04)
Count up the likely field separators in the header line with AWK
Long, narrow tables vs short, wide ones (2019-08-16)
Three tests of processing speed show that table shape doesn't matter
A bulk replacement GUI with YAD (2019-08-02)
A shell script for "normalising" pseudo-duplicates in a data table
Finding malformed markup (2019-07-19)
How to identify messed-up HTML tags in non-HTML documents
Leading and trailing whitespace (2019-06-28)
How to find and delete "fore and aft" whitespace within fields in a data table
Growing the Cookbook's "broken" function (2019-05-31)
A more informative way to tally up the number of fields in a data table
How to delete, insert and replace whole lines (2019-05-12)
Use line addresses to target just the right lines
How to delete, insert and replace whole fields (2019-05-05)
Cut and paste are usually the right tools for these jobs
Comparing fields across two tables (2019-02-03; updated)
A script to check for changes in a field
How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons? (2018-12-16)
On the command line, you can ignore everything but the numbers
Finding changepoints in a list, revisited (2018-12-06)
Using AWK to find where values change in a list
How to find distances between lat/lons for geochecking (2018-11-07)
When you're looking for big differences, an approximate method is fine
Bird watching with AWK and grep (2018-10-24)
Showing off the fastest way to search a text file for strings in another file
How to validate ISO 8601 dates without regex (2018-10-05)
Check for format and content errors in YYYY-MM-DD fields with AWK
Fightin' fields (2018-09-30)
Finding disagreements between data fields can be challenging
Fuzzy matching in practice (2018-09-23)
Tips for approximate matching with tre-agrep
48 sea levels and a trope for your terminal (2018-08-11)
A bulk string replacement with AWK, and that ACCESS DENIED thing
Pseudo-blank ("empty") records and fields (2018-08-04)
How to find not-quite-empty rows and columns in a data table
Time series ops (2018-07-23)
Using AWK to summarise time series data
Partial duplicates (2018-07-14)
One way to find "pseudoduplicated" records
Truncated data items (2018-07-04)
Detecting truncations, such as a 100-character string clipped to 50 characters
Compare parts of strings (2018-05-22)
How to use AWK's "split" function to compare parts of strings
Characters and encoding
Mojibake bonanza (2020-12-16)
New mojibake origin puzzles from a museum database
Encoding detection smackdown (2020-09-23)
enca vs file vs iconv vs isutf8 vs uchardet
Character equivalence classes 2: the nature of equivalence (2020-06-24)
What does "something like" actually mean?
Character equivalence classes 1: search and replace (2020-06-17)
How to find "something like" a character
More mojibake fun (2020-04-01)
Easy-to-hard examples of translating from gibberish
Hunting gremlins (2020-01-22)
A script to make invisible gremlin characters visible
Build your own character class inventories — updated (2019-12-27)
Find out what [:alpha:] and [:cntrl:] mean in your system
Introducing the replo (2019-11-01)
Character replacements by computers can be reversible, reconstructable or researchable
An unexpected character replacement (2019-10-18)
Strange replacements of non-ASCII characters by R
Return of the mojibake detective (2019-07-05)
Three new cases of mysterious character corruptions
Quotes as characters (2019-04-07; updated 2019-05-26)
How to recognise the nine different kinds of single and double quotes
How to choose special characters, revisited (2019-03-24)
Scripting a little GUI for copying/pasting your most often-used special characters
iconv and illegal input sequences (2018-09-13)
Getting around a roadblock in changing the character encoding of a file
SCI and 62;c62;c62;c... (2018-08-25)
A control character causes strange behaviour in GUI terminals
Mojibake detective work (2018-08-06)
A close look at some character encoding problems
Question marks that aren't really question marks (2018-07-27)
Some question marks show that a program doesn't understand a character's encoding
Combo characters (2018-06-09)
How to deal with Unicode's combining characters
Data formatting
Converting a list to a presence/absence table (2021-02-10)
Re-formatting is easy with tidy, well-structured data
ASCII score bars and a gorblimey command (2021-01-27)
How to build a string of characters and their complement
Form text and placeholders (2021-01-13)
Form letters, diaries and mail merge in plain text
Comparing strings more clearly (2020-12-09)
How to make and emphasise a string comparison between fields
Re-format blah,YYYYMMDD,blah as blah,YYYY,MM,DD,blah (2020-12-02)
How to do it with sed or AWK: 7 methods
How to stack columns (2020-11-25)
Turn a "columnated" table into a straight up-and-down one
Building a data table from a sentence (2020-10-07)
How to expand a condensed data structure
Spotting spaces, and AWK's view of emptiness (2020-09-09)
A simple way to show and count plain whitespaces,
and "non-empty" vs "non-empty and non-zero" in AWK
How to number copy/pasted commands (2020-08-05)
A neat way to number and indent commands and their outputs
Sharing data and metadata together (2020-07-29)
How not to lose a data table's metadata
A quick repair job on a dislocated table (2020-07-15)
Fixing a table with displaced fields
Extra commas in a CSV (2020-07-08)
How to safely delete just the excess commas
Join consecutive lines if condition applies (2020-06-03)
Simple ways to fix embedded newlines
Printing repeats within repeats, and splitting a list into columns (2020-05-27)
Why I use pr rather than column for some columnating jobs
How to move selected lines within a file (2020-05-13)
No need to cut and paste, use the command line
Dealing with an all-CAPS/first-CAP jumble (2020-04-29)
How to normalise a mix of WORDS and Words
How to be uncertain with dates (2020-02-12)
A skeptical look at some of ISO 8601's new extensions
JSON Lines: record-style JSON (2020-01-29)
A bridge between table-style data and standard JSON
Emphasising text in the terminal (2019-12-13)
Making selected strings stand out with ANSI codes
Embedded newlines without a clue (2019-11-15)
Without clear markers for field fragments, you need to be creative
Add leading zeroes that aren't really leading (2019-09-13)
How to format numbers when they're inside non-numeric strings
A GUI to re-order fields in a table (2019-08-30)
A shell script for building a new table with reordered fields
The lat/lon floating point delusion (2019-08-09)
That big building is at -33.8903169365705 151.198409720645? Really?
Renumber a list after inserting a line — updated (2019-07-27)
A handy function for inserting and renumbering
Data from dingbats: copying down (2019-02-24)
Copying down is easy in a spreadsheet, but it's also possible on the command line
Fancy numbering of records (2019-02-17)
On the command line, you can number a list of records any way you like
Reformatting a list, cleverly (2019-01-27)
Create horizontal lists from a vertical one
Horizontal sorting within a field (2019-01-13)
There are two different ways to sort a field "horizontally", but neither of them is simple.
Changing the month format: a fairly general solution (2018-12-30)
Build a look-up table and use the starting and finishing format in an AWK command
Putting information into a table from the table's filename (2018-12-13)
The example adds a date from the filename to each record in the table
Unwrap your fasta (2018-12-01)
How to concatenate the sequence lines in FASTA files
Repair job: separate the tandem repeats (2018-10-26)
How to split a tandem repeat between fields
Too many lat/lon digits (2018-06-30)
Rounding off latitude/longitude data to an appropriate number of significant figures
Embedded newlines (2018-06-23)
How to safely remove embedded newlines
Data analysis examples
The myth of equinoctial gales (2020-10-14)
Real-world wind data don't show equinoctial gales
What's wrong with these records? (2020-08-26)
Tinkering with "present in these records, absent in those"
Checking date components across fields (2020-04-15)
Does "date" agree with "year", "month" and "day"?
Life tables (2020-03-11)
A sober look at the probability of dying in Australia
Data quality in iNaturalist downloads (2020-02-05)
Top marks for data from the citizen-science iNaturalist project
Steady as she goes, in Darwin (2019-10-25)
The daily temperatures in Darwin (Australia) are remarkably constant
Two ugly CSVs (2019-04-28)
Open but messy data from the Australian Electoral Commission and Companies House
Dog and cat data (2019-03-31)
A command-line exploration of five public datasets
Data with bulges (2019-03-10)
Three cases of unexpectedly large values in a data item
Two special data validations (2019-03-03)
Is that tree correctly located? Is that list of names and addresses truly regular?
Drugs on the command line (2019-01-06)
A disappointing dive into drugs data from the US Food and Drug Administration
Has the rainfall pattern in my hometown changed? (2018-12-23)
No obvious trends in number, length or intensity of rainfall events in recent years
Fun with BOM data (2018-07-11)
Weather watching with wget and gnuplot
Pivoting airlines (2018-06-03)
Using arrays of arrays to build a pivot table with AWK
AWK tips and tricks
Updating a file from a lookup table (2020-11-11)
How to use an AWK array for lookup operations
How to use flags in AWK (revisited) (2020-10-21)
Flags are handy for defining AWK's working range of records
The easy-going syntax of AWK commands (2020-02-26)
AWK is flexible and tolerant in its command rules
Another surprising AWK trick (2019-12-06)
Strings or numbers? It depends on what you're doing with them.
A muggle's guide to AWK arrays: 4 (2019-09-20)
Easier and more flexible ways to sort array outputs
A muggle's guide to AWK arrays: 3 (2019-08-23)
Reformatting and table joining using arrays
A muggle's guide to AWK arrays: 2 (2019-07-12)
Working with two files, or the same file twice
A muggle's guide to AWK arrays: 1 (2019-06-07)
Array naming, index strings and value strings
A surprising AWK trick (2018-05-27)
A clever way to avoid using a flag in AWK
BASH tips and tricks
How to bookmark directories in the shell (2020-06-10)
A couple of functions is all it takes
Brace expansion with variables and arrays: eval to the rescue (2020-04-22)
eval, a BASH built-in, solves brace expansion problems
Getting around a subshell problem (2020-01-15)
Something strange happens with buffering in a subshell
Working around the BASH brace expansion rule (2019-06-14)
How to build Cartesian string products in BASH
The magic of BASH string expansion (2019-05-19)
A simple trick that allows AWK and sed to use BASH as an interpreter
Avoiding senior moments with command-line functions (2018-11-13)
The trick is to make the documentation available on the CLI
Useful programs for command-line data ops
VisiData: a table explorer for the terminal (2019-10-11)
Display, sort, reformat and more with this CLI utility
Transpose, pivot and bin with GNU Datamash 1.4 (2019-05-24)
Do complex data transformations more easily with Datamash
Parsing scientific names (gnparser) (2019-01-20)
Scientific names are much harder to parse than personal names
Data entry and display
A sunset surprise (2021-02-17)
Data graphics help to explain a puzzling phenomenon
Changing TTY prompt, font and colors (2020-02-19)
How to prettify your virtual terminals
Data validation on entry with YAD (2019-11-29)
In praise of lookup lists for data entry, with help from YAD dialogs
Plotting data in the terminal with gnuplot (2019-06-21)
A separate graphic is much better than an in-terminal plot
Making pictures with data (2019-04-14)
How to display data bytes as image bytes
Mapping with gnuplot (2018-10-31)
How to use gnuplot to put data points on a basemap
How to enter nothing in a database (2018-10-18)
If you have nothing to say, say nothing
Displaying data from table fragments (2018-09-06)
One way to build a tidy table from a jumble of data
A record pager built with YAD (2018-08-18; updated 2018-09-09)
How to turn a YAD dialog into a GUI viewer/pager for records in a data table
GUI ways to view and edit big text files (2018-07-31)
glogg, gvim, Geany and csvpad, but not spreadsheets
YAD repeat and edit (2018-05-21)
How to avoid re-entering data in a YAD data entry form
The Windows and spreadsheet worlds
Spreadsheet annoyance no. 3: quotes have priority (2021-01-20)
Beware of unmatched quotes in data items
A grizzle about captive data (2020-07-22)
Don't confuse data with the Windows software that contains it
Spreadsheet annoyance no. 2 (2019-04-21)
Spreadsheets make dates out of entries that aren't dates, but that's not all they mess up
The trouble with Windows CRLF (2019-03-17)
Windows line endings are in a pain in the ... terminal
Getting data out of Excel safely (2019-02-10)
Watch out for embedded linebreaks, comma problems and character encoding issues
Curse of the CSV monster (2018-07-18)
CSV with broken records to TSV
Miscellaneous stuff
A short rant about Python, R and UNIX (2020-10-28)
Why would you clean data with Python or R?
A data table thousands of years old (2020-08-12)
Modern record-keeping in ancient Mesopotamia
Second Tuesday of each month and a BASHing data century (2020-03-25)
ncal and the 100th blog post
Msot popele can undreatnsd tihs setennce (2019-12-20)
Garbling and ungarbling with shell scripts
Python and shell tools (2019-11-22)
A comparison of three data operations
A command-line "Countdown" (UK) companion (2019-09-27)
Fast solver for anagram puzzles, and a puzzle generator
Getting data from an Enphase Envoy S (2019-09-06)
Two user-accessible JSON files with performance data
Data on clay (2018-09-20)
Cheap data storage for thousands of years? Check.
Ancient glyphs in your terminal? Check.