banner

About this blog

BASHing data is a companion blog to A Data Cleaner's Cookbook. It continues a series of data-related posts I contributed from 2014 to 2018 to Andrew Powell's Linux Rain blog.

This is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.

Want to comment?

Email me. If your remarks are on-topic and helpful, they'll be edited straight into the relevant post, not buried in a list of comments at the bottom of the webpage.

Want notice of future posts?

Copy the RSS link into your feed reader:    RSS

About me

I'm a part-time data auditor and incompletely retired zoologist.

Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com


List of posts:

The magic of BASH string expansion (2019-05-19)
     A simple trick that allows AWK and sed to use BASH as an interpreter

How to delete, insert and replace whole lines (2019-05-12)
     Use line addresses to target just the right lines

How to delete, insert and replace whole fields (2019-05-05)
     Cut and paste are usually the right tools for these jobs

Two ugly CSVs (2019-04-28)
     Open but messy data from the Australian Electoral Commission and Companies House (UK)

Spreadsheet annoyance no. 2 (2019-04-21)
     Spreadsheets make dates out of entries that aren't dates, but that's not all they mess up

Making pictures with data (2019-04-14)
     How to display data bytes as image bytes

Quotes as characters (2019-04-07)
     How to recognise the nine different kinds of single and double quotes

Dog and cat data (2019-03-31)
     A command-line exploration of five public datasets

How to choose special characters, revisited (2019-03-24)
     Scripting a little GUI for copying/pasting your most often-used special characters

The trouble with Windows CRLF (2019-03-17)
     Windows line endings are in a pain in the ... terminal

Data with bulges (2019-03-10)
     Three cases of unexpectedly large values in a data item

Two special data validations (2019-03-03)
     Is that tree correctly located? Is that list of names and addresses truly regular?

Data from dingbats: copying down (2019-02-24)
     Copying down is easy in a spreadsheet, but it's also possible on the command line

Fancy numbering of records (2019-02-17)
     On the command line, you can number a list of records any way you like

Getting data out of Excel safely (2019-02-10)
     Watch out for embedded linebreaks, comma problems and character encoding issues

Comparing fields across two tables (2019-02-03; updated)
     A script to check for changes in a field

Reformatting a list, cleverly (2019-01-27)
     Create horizontal lists from a vertical one

Parsing scientific names (2019-01-20)
     Scientific names are much harder to parse than personal names

Horizontal sorting within a field (2019-01-13)
     There are two different ways to sort a field "horizontally", but neither of them is simple.

Drugs on the command line (2019-01-06)
     A disappointing dive into drugs data from the US Food and Drug Administration

Changing the month format: a fairly general solution (2018-12-30)
     Build a look-up table and use the starting and finishing format in an AWK command

Has the rainfall pattern in my hometown changed? (2018-12-23)
     No obvious trends in number, length or intensity of rainfall events in recent years

How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons? (2018-12-16)
     On the command line, you can ignore everything but the numbers

Putting information into a table from the table's filename (2018-12-13)
     The example adds a date from the filename to each record in the table

Finding changepoints in a list, revisited (2018-12-06)
     Using AWK to find where values change in a list

Unwrap your fasta (2018-12-01)
     How to concatenate the sequence lines in FASTA files

Avoiding senior moments with command-line functions (2018-11-13)
     The trick is to make the documentation available on the CLI

How to find distances between lat/lons for geochecking (2018-11-07)
     When you're looking for big differences, an approximate method is fine

Mapping with gnuplot (2018-10-31)
     How to use gnuplot to put data points on a basemap

Repair job: separate the tandem repeats (2018-10-26)
     How to split a tandem repeat between fields

Bird watching with AWK and grep (2018-10-24)
     Showing off the fastest way to search a text file for strings in another file

How to enter nothing in a database (2018-10-18)
     If you have nothing to say, say nothing

How to validate ISO 8601 dates without regex (2018-10-05)
     Check for format and content errors in YYYY-MM-DD fields with AWK

Fightin' fields (2018-09-30)
     Finding disagreements between data fields can be challenging

Fuzzy matching in practice (2018-09-23)
     Tips for approximate matching with tre-agrep

Data on clay (2018-09-20)
     Cheap data storage for thousands of years? Check. Ancient glyphs in your terminal? Check.

iconv and illegal input sequences (2018-09-13)
     Getting around a roadblock in changing the character encoding of a file

Displaying data from table fragments (2018-09-06)
     One way to build a tidy table from a jumble of data

SCI and 62;c62;c62;c... (2018-08-25)
     A control character causes strange behaviour in GUI terminals

A record pager built with YAD (2018-08-18; updated 2018-09-09)
     How to turn a YAD dialog into a GUI viewer/pager for records in a data table

48 sea levels and a trope for your terminal (2018-08-11)
     A bulk string replacement with AWK, and that ACCESS DENIED thing

Mojibake detective work (2018-08-06)
     A close look at some character encoding problems

Pseudo-blank ("empty") records and fields (2018-08-04)
     How to find not-quite-empty rows and columns in a data table

GUI ways to view and edit big text files (2018-07-31)
     glogg, gvim, Geany and csvpad, but not spreadsheets

Question marks that aren't really question marks (2018-07-27)
     Some question marks are signs that a program doesn't understand a character's encoding

Time series ops (2018-07-23)
     Using AWK to summarise time series data

Curse of the CSV monster (2018-07-18)
     CSV with broken records to TSV

Partial duplicates (2018-07-14)
     One way to find "pseudoduplicated" records

Fun with BOM data (2018-07-11)
     Weather watching with wget and gnuplot

Truncated data items (2018-07-04)
     Detecting truncations, such as a 100-character string clipped to 50 characters in a database

Too many lat/lon digits (2018-06-30)
     Rounding off latitude/longitude data to an appropriate number of significant figures

Embedded newlines (2018-06-23)
     How to safely remove embedded newlines

Combo characters (2018-06-09)
     How to deal with Unicode's combining characters

Pivoting airlines (2018-06-03)
     Using arrays of arrays to build a pivot table with AWK

A surprising AWK trick (2018-05-27)
     A clever way to avoid using a flag in AWK

Compare parts of strings (2018-05-22)
     How to use AWK's "split" function to compare parts of strings

YAD repeat and edit (2018-05-21)
     How to avoid re-entering data in a YAD data entry form