banner

About this blog

BASHing data is a companion blog to A Data Cleaner's Cookbook. It continues a series of data-related posts I contributed from 2014 to 2018 to Andrew Powell's Linux Rain blog.

This is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.

Want to comment?

Email me. If your remarks are on-topic and helpful, they'll be edited straight into the relevant post, not buried in a list of comments at the bottom of the webpage.

Want notice of future posts?

Copy the RSS link into your feed reader:    RSS

About me

I'm a data auditor and retired zoologist.

Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com

The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Most recent post:

Embedded newlines without a clue (2019-11-15)
     Without clear markers for field fragments, you need to be creative



Posts by category (most recent post first):

Data auditing, cleaning and processing

Topping and tailing, and the slowness of GNU sort (2019-11-08)
     GNU sort can be a rate-limiting step in a pipeline

How to guess the field separator in a table (2019-10-04)
     Count up the likely field separators in the header line with AWK

Long, narrow tables vs short, wide ones (2019-08-16)
     Three tests of processing speed show that table shape doesn't matter

A bulk replacement GUI with YAD (2019-08-02)
     A shell script for "normalising" pseudo-duplicates in a data table

Finding malformed markup (2019-07-19)
     How to identify messed-up HTML tags in non-HTML documents

Leading and trailing whitespace (2019-06-28)
     How to find and delete "fore and aft" whitespace within fields in a data table

Growing the Cookbook's "broken" function (2019-05-31)
     A more informative way to tally up the number of fields in a data table

How to delete, insert and replace whole lines (2019-05-12)
     Use line addresses to target just the right lines

How to delete, insert and replace whole fields (2019-05-05)
     Cut and paste are usually the right tools for these jobs

Comparing fields across two tables (2019-02-03; updated)
     A script to check for changes in a field

How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons? (2018-12-16)
     On the command line, you can ignore everything but the numbers

Finding changepoints in a list, revisited (2018-12-06)
     Using AWK to find where values change in a list

How to find distances between lat/lons for geochecking (2018-11-07)
     When you're looking for big differences, an approximate method is fine

Bird watching with AWK and grep (2018-10-24)
     Showing off the fastest way to search a text file for strings in another file

How to validate ISO 8601 dates without regex (2018-10-05)
     Check for format and content errors in YYYY-MM-DD fields with AWK

Fightin' fields (2018-09-30)
     Finding disagreements between data fields can be challenging

Fuzzy matching in practice (2018-09-23)
     Tips for approximate matching with tre-agrep

48 sea levels and a trope for your terminal (2018-08-11)
     A bulk string replacement with AWK, and that ACCESS DENIED thing

Pseudo-blank ("empty") records and fields (2018-08-04)
     How to find not-quite-empty rows and columns in a data table

Time series ops (2018-07-23)
     Using AWK to summarise time series data

Partial duplicates (2018-07-14)
     One way to find "pseudoduplicated" records

Truncated data items (2018-07-04)
     Detecting truncations, such as a 100-character string clipped to 50 characters in a database

Compare parts of strings (2018-05-22)
     How to use AWK's "split" function to compare parts of strings


Characters and encoding

Introducing the replo (2019-11-01)
     Character replacements by computers can be reversible, reconstructable or researchable

An unexpected character replacement (2019-10-18)
     Strange replacements of non-ASCII characters by R

Return of the mojibake detective (2019-07-05)
     Three new cases of mysterious character corruptions

Quotes as characters (2019-04-07; updated 2019-05-26)
     How to recognise the nine different kinds of single and double quotes

How to choose special characters, revisited (2019-03-24)
     Scripting a little GUI for copying/pasting your most often-used special characters

iconv and illegal input sequences (2018-09-13)
     Getting around a roadblock in changing the character encoding of a file

SCI and 62;c62;c62;c... (2018-08-25)
     A control character causes strange behaviour in GUI terminals

Mojibake detective work (2018-08-06)
     A close look at some character encoding problems

Question marks that aren't really question marks (2018-07-27)
     Some question marks are signs that a program doesn't understand a character's encoding

Combo characters (2018-06-09)
     How to deal with Unicode's combining characters


Data formatting

Embedded newlines without a clue (2019-11-15)
     Without clear markers for field fragments, you need to be creative

Add leading zeroes that aren't really leading (2019-09-13)
     How to format numbers when they're inside non-numeric strings

A GUI to re-order fields in a table (2019-08-30)
     A shell script for building a new table with reordered fields

The lat/lon floating point delusion (2019-08-09)
     That big building is at -33.8903169365705 151.198409720645? Really?

Renumber a list after inserting a line — updated (2019-07-27)
     A handy function for inserting and renumbering

Data from dingbats: copying down (2019-02-24)
     Copying down is easy in a spreadsheet, but it's also possible on the command line

Fancy numbering of records (2019-02-17)
     On the command line, you can number a list of records any way you like

Reformatting a list, cleverly (2019-01-27)
     Create horizontal lists from a vertical one

Horizontal sorting within a field (2019-01-13)
     There are two different ways to sort a field "horizontally", but neither of them is simple.

Changing the month format: a fairly general solution (2018-12-30)
     Build a look-up table and use the starting and finishing format in an AWK command

Putting information into a table from the table's filename (2018-12-13)
     The example adds a date from the filename to each record in the table

Unwrap your fasta (2018-12-01)
     How to concatenate the sequence lines in FASTA files

Repair job: separate the tandem repeats (2018-10-26)
     How to split a tandem repeat between fields

Too many lat/lon digits (2018-06-30)
     Rounding off latitude/longitude data to an appropriate number of significant figures

Embedded newlines (2018-06-23)
     How to safely remove embedded newlines


Data analysis examples

Steady as she goes, in Darwin (2019-10-25)
     The daily temperatures in Darwin (Australia) are remarkably constant

Two ugly CSVs (2019-04-28)
     Open but messy data from the Australian Electoral Commission and Companies House (UK)

Dog and cat data (2019-03-31)
     A command-line exploration of five public datasets

Data with bulges (2019-03-10)
     Three cases of unexpectedly large values in a data item

Two special data validations (2019-03-03)
     Is that tree correctly located? Is that list of names and addresses truly regular?

Drugs on the command line (2019-01-06)
     A disappointing dive into drugs data from the US Food and Drug Administration

Has the rainfall pattern in my hometown changed? (2018-12-23)
     No obvious trends in number, length or intensity of rainfall events in recent years

Fun with BOM data (2018-07-11)
     Weather watching with wget and gnuplot

Pivoting airlines (2018-06-03)
     Using arrays of arrays to build a pivot table with AWK


AWK tips and tricks

A muggle's guide to AWK arrays: 4 (2019-09-20)
     Easier and more flexible ways to sort array outputs

A muggle's guide to AWK arrays: 3 (2019-08-23)
     Reformatting and table joining using arrays

A muggle's guide to AWK arrays: 2 (2019-07-12)
     Working with two files, or the same file twice

A muggle's guide to AWK arrays: 1 (2019-06-07)
     Array naming, index strings and value strings

A surprising AWK trick (2018-05-27)
     A clever way to avoid using a flag in AWK


BASH tips and tricks

Working around the BASH brace expansion rule (2019-06-14)
     How to build Cartesian string products in BASH

The magic of BASH string expansion (2019-05-19)
     A simple trick that allows AWK and sed to use BASH as an interpreter

Avoiding senior moments with command-line functions (2018-11-13)
     The trick is to make the documentation available on the CLI


Useful programs for command-line data ops

VisiData: a table explorer for the terminal (2019-10-11)
     Display, sort, reformat and more with this CLI utility

Transpose, pivot and bin with GNU Datamash 1.4 (2019-05-24)
     Do complex data transformations more easily with Datamash

Parsing scientific names (gnparser) (2019-01-20)
     Scientific names are much harder to parse than personal names


Data entry and display

Plotting data in the terminal with gnuplot (2019-06-21)
     A separate graphic is much better than an in-terminal plot

Making pictures with data (2019-04-14)
     How to display data bytes as image bytes

Mapping with gnuplot (2018-10-31)
     How to use gnuplot to put data points on a basemap

How to enter nothing in a database (2018-10-18)
     If you have nothing to say, say nothing

Displaying data from table fragments (2018-09-06)
     One way to build a tidy table from a jumble of data

A record pager built with YAD (2018-08-18; updated 2018-09-09)
     How to turn a YAD dialog into a GUI viewer/pager for records in a data table

GUI ways to view and edit big text files (2018-07-31)
     glogg, gvim, Geany and csvpad, but not spreadsheets

YAD repeat and edit (2018-05-21)
     How to avoid re-entering data in a YAD data entry form


The Windows and spreadsheet worlds

Spreadsheet annoyance no. 2 (2019-04-21)
     Spreadsheets make dates out of entries that aren't dates, but that's not all they mess up

The trouble with Windows CRLF (2019-03-17)
     Windows line endings are in a pain in the ... terminal

Getting data out of Excel safely (2019-02-10)
     Watch out for embedded linebreaks, comma problems and character encoding issues

Curse of the CSV monster (2018-07-18)
     CSV with broken records to TSV


Miscellaneous stuff

A command-line "Countdown" (UK) companion (2019-09-27)
     Fast solver for anagram puzzles, and a puzzle generator

Getting data from an Enphase Envoy S (2019-09-06)
     Two user-accessible JSON files with performance data

Data on clay (2018-09-20)
     Cheap data storage for thousands of years? Check. Ancient glyphs in your terminal? Check.