banner

For a list of BASHing data 2 blog posts see the index page.    RSS


A data table full of ghosts

It's a coincidence that this post was uploaded on Halloween and that the subject is ghosts. By "ghosts" I mean invisible characters (whitespaces) representing visible ones that have been deleted.

In a recent real-world case, the author of a data table had replaced ca 150,000 "NA" null values in the table's numerical fields with blanks. The "blanks" weren't empty strings, though. Instead the author had replaced each NA with a single whitespace. The NAs had disappeared, but their replacements were different from the genuine blanks in non-numerical fields in the table. The whitespaces were the "ghosts" of the NAs.

Here's a tab-separated demonstration file ("ghosty"):

fld1fld2fld3fld4fld5
I 3e1Csl9VUc3
4URY9f6IE Hk5
0DLI1yU5p94 H0FJ
1Sd nN346LcF7
9WQeG4O26DP6
E4 Q0SI 8X10Pa
uC03Rwc Y3
8Ja4bLP8pl1H

It looks OK in a terminal, but the spacevis function reveals that there's more than one kind of missing value in "ghosty":

reveal

The "cntghost" function uses AWK to count ghosts in a TSV. By setting "sum" to zero initially, it reports "0" if no ghosts are found:

cntghost() { awk -F"\t" 'BEGIN {sum=0} {for (i=1;i<=NF;i++) if ($i ~ /^[ ]+$/) sum++} END {print "There are "sum" ghosts in this table"}' "$1"; }

counter

AWK is told with -F"\t" that the table is tab-separated, and a BEGIN statement sets the value of the variable "sum" to 0 (BEGIN {sum=0} before any processing happens. AWK then works through each line field by field (for (i=1;i<=NF;i++)). If the data item in a field is only one or more whitespaces ( if ($i ~ /^[ ]+$/)), the "sum" variable is incremented by one (sum++). After all lines have been processed, AWK prints a statement with the "sum" variable within it (END {print "There are "sum" ghosts in this table"}).

AWK also makes a good ghost-remover:

awk -F"\t" '{for (i=1;i<=NF;i++) if ($i ~ /^[ ]+$/) $i=""}' OFS="\t"

AWK-remover

See the last AWK command explanation, above. If AWK finds a data item consisting of one or more whitespaces, it sets the empty string as the new content of that data item ($i=""). Since a field value has been changed, AWK needs to be told what the field separator is (OFS="\t"), and here I've done that after the command.

.

Next post:
2025-11-07   Not on the keyboard. How to type it?


Last update: 2025-10-31
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License