For a list of BASHing data 2 blog posts see the index page. ![]()
A data table full of ghosts
It's a coincidence that this post was uploaded on Halloween and that the subject is ghosts. By "ghosts" I mean invisible characters (whitespaces) representing visible ones that have been deleted.
In a recent real-world case, the author of a data table had replaced ca 150,000 "NA" null values in the table's numerical fields with blanks. The "blanks" weren't empty strings, though. Instead the author had replaced each NA with a single whitespace. The NAs had disappeared, but their replacements were different from the genuine blanks in non-numerical fields in the table. The whitespaces were the "ghosts" of the NAs.
Here's a tab-separated demonstration file ("ghosty"):
| fld1 | fld2 | fld3 | fld4 | fld5 |
| I 3e | 1Cs | l9V | Uc3 | |
| 4UR | Y9f | 6IE | Hk5 | |
| 0DL | I1y | U5p | 94 H | 0FJ |
| 1Sd | nN3 | 46L | cF7 | |
| 9WQ | eG4 | O26 | DP6 | |
| E4 Q | 0SI | 8X1 | 0Pa | |
| uC0 | 3Rw | c Y3 | ||
| 8Ja | 4bL | P8p | l1H |
It looks OK in a terminal, but the spacevis function reveals that there's more than one kind of missing value in "ghosty":
The "cntghost" function uses AWK to count ghosts in a TSV. By setting "sum" to zero initially, it reports "0" if no ghosts are found:
cntghost() { awk -F"\t" 'BEGIN {sum=0} {for (i=1;i<=NF;i++) if ($i ~ /^[ ]+$/) sum++} END {print "There are "sum" ghosts in this table"}' "$1"; }
AWK is told with -F"\t" that the table is tab-separated, and a BEGIN statement sets the value of the variable "sum" to 0 (BEGIN {sum=0} before any processing happens. AWK then works through each line field by field (for (i=1;i<=NF;i++)). If the data item in a field is only one or more whitespaces ( if ($i ~ /^[ ]+$/)), the "sum" variable is incremented by one (sum++). After all lines have been processed, AWK prints a statement with the "sum" variable within it (END {print "There are "sum" ghosts in this table"}).
AWK also makes a good ghost-remover:
awk -F"\t" '{for (i=1;i<=NF;i++) if ($i ~ /^[ ]+$/) $i=""}' OFS="\t"
See the last AWK command explanation, above. If AWK finds a data item consisting of one or more whitespaces, it sets the empty string as the new content of that data item ($i=""). Since a field value has been changed, AWK needs to be told what the field separator is (OFS="\t"), and here I've done that after the command.
.Next post:
2025-11-07 Not on the keyboard. How to type it?
Last update: 2025-10-31
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License